Multimodal Learning
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3363 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
LLaVA-UHD v2λ κ³μΈ΅μ μλμ° λ³νκΈ°λ₯Ό μ΄μ©, κ³ ν΄μλ νΉμ§ νΌλΌλ―Έλλ₯Ό ν΅ν©νμ¬ λ€μν μκ°μ μΈλΆ μ 보λ₯Ό ν¬μ°©νλ νμ μ μΈ λ€μ€ λͺ¨λ¬ μΈμ΄ λͺ¨λΈμ
λλ€.
GUI Agents: A Survey
·207 words·1 min·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Human-AI Interaction
π’ University of Maryland
λκ·λͺ¨ μΈμ΄ λͺ¨λΈ κΈ°λ° GUI μμ΄μ νΈ κΈ°μ μ μ΅μ λν₯μ μ’
ν©μ μΌλ‘ λΆμνκ³ , λ²€μΉλ§ν¬, νκ° μ§ν, μν€ν
μ², νμ΅ λ°©λ²μ 체κ³μ μΌλ‘ λΆλ₯νμ¬ ν΅ν© νλ μμν¬λ₯Ό μ μν©λλ€.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2500 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Hong Kong University of Science and Technology
μκ° μ λ¬Έκ° λͺ¨λΈμ νμ©ν μ΄λ―Έμ§ μΊ‘μ
ν₯μμΌλ‘ λ€μ€ λͺ¨λ¬ λͺ¨λΈ μ±λ₯ κ°μ
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
·2232 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Shanghai Jiao Tong University
GeoX: MLLMλ³΄λ€ λ°μ΄λ κΈ°ννμ λ¬Έμ ν΄κ²°μ¬!
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
·2277 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ NVIDIA Research
ResGen, κ³ νμ§ μμ±κ³Ό λΉ λ₯Έ μνλ§ μλλ₯Ό λͺ¨λ λ¬μ±νλ ν¨μ¨μ μΈ RVQ κΈ°λ° μμ± λͺ¨λΈ.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1707 words·9 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Meta GenAI
Apollo: λκ·λͺ¨ λ©ν°λͺ¨λ¬ λͺ¨λΈμ λΉλμ€ μ΄ν΄λ₯Ό μν μ¬μΈ΅ νꡬ.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3268 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
SynerGen-VL: κ°λ¨ν κ΅¬μ‘°λ‘ μ΄λ―Έμ§ μ΄ν΄ λ° μμ±μ λμμ μννλ κ°λ ₯ν MLLM.
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
·2344 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Generation
π’ University of Edinburgh
VMBλ ν
μ€νΈ λ° μμ
λΈλ¦¬μ§λ₯Ό νμ©νμ¬ λ©ν°λͺ¨λ¬ μμ
μμ±μ μν μλ‘κ³ μ μ΄ κ°λ₯ν νλ μμν¬λ₯Ό μ μν©λλ€.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
·3354 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Human-AI Interaction
π’ Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2.5-OmniLive: μ€μκ° μ€νΈλ¦¬λ° λΉλμ€ λ° μ€λμ€ μνΈμμ©μ μν μΈκ°μ μΈμ§λ₯λ ₯μ λͺ¨λ°©ν νμ μ λ€μ€ λͺ¨λ AI μμ€ν
GenEx: Generating an Explorable World
·2180 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Embodied AI
π’ Johns Hopkins University
GenEx: λ¨μΌ μ΄λ―Έμ§λ‘ νμ κ°λ₯ν 3D μΈκ³ μμ±.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2792 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2: μλμ΄-μμ΄ μ΄μ€ μΈμ΄ μλ£ μ λ¬Έκ° LMM μΆμ!