Multimodal Learning

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

18 December 2024·3363 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

LLaVA-UHD v2는 계층적 윈도우 변환기를 이용, 고해상도 특징 피라미드를 통합하여 다양한 시각적 세부 정보를 포착하는 혁신적인 다중 모달 언어 모델입니다.

GUI Agents: A Survey

18 December 2024·207 words·1 min· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 University of Maryland

대규모 언어 모델 기반 GUI 에이전트 기술의 최신 동향을 종합적으로 분석하고, 벤치마크, 평가 지표, 아키텍처, 학습 방법을 체계적으로 분류하여 통합 프레임워크를 제시합니다.

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

18 December 2024·2500 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

시각 전문가 모델을 활용한 이미지 캡션 향상으로 다중 모달 모델 성능 개선

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

16 December 2024·2232 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University

GeoX: MLLM보다 뛰어난 기하학적 문제 해결사!

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

13 December 2024·2277 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NVIDIA Research

ResGen, 고품질 생성과 빠른 샘플링 속도를 모두 달성하는 효율적인 RVQ 기반 생성 모델.

Apollo: An Exploration of Video Understanding in Large Multimodal Models

13 December 2024·1707 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI

Apollo: 대규모 멀티모달 모델의 비디오 이해를 위한 심층 탐구.

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

12 December 2024·3268 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

SynerGen-VL: 간단한 구조로 이미지 이해 및 생성을 동시에 수행하는 강력한 MLLM.

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

12 December 2024·2344 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Edinburgh

VMB는 텍스트 및 음악 브리지를 활용하여 멀티모달 음악 생성을 위한 새롭고 제어 가능한 프레임워크를 제시합니다.

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

12 December 2024·3354 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 Shanghai Artificial Intelligence Laboratory

InternLM-XComposer2.5-OmniLive: 실시간 스트리밍 비디오 및 오디오 상호작용을 위한 인간의 인지능력을 모방한 혁신적 다중 모드 AI 시스템

GenEx: Generating an Explorable World

12 December 2024·2180 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Johns Hopkins University

GenEx: 단일 이미지로 탐색 가능한 3D 세계 생성.

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

10 December 2024·2792 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Mohamed Bin Zayed University of Artificial Intelligence

BiMediX2: 아랍어-영어 이중 언어 의료 전문가 LMM 출시!