Multimodal Learning
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
·1981 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Chinese University of Hong Kong
Dispider: μ€μκ° μνΈμμ©μ μν΄ λΆλ¦¬λ μΈμ, κ²°μ , λ°μμ μ¬μ©νλ λΉλμ€ LLMμ κ°λ₯νκ² ν©λλ€.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
·2176 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tencent Youtu Lab
VITA-1.5: μ€μκ° μκ° λ° μμ± μνΈμμ©μ μν GPT-40 μμ€μ λ€μ€ λͺ¨λ¬ LLM
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
·3242 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Reasoning
π’ Gaoling School of Artificial Intelligence, Renmin University of China
Virgo: ν
μ€νΈ κΈ°λ° μ₯λ¬Έ μ¬κ³ λ°μ΄ν°λ₯Ό νμ©, λ€μν λ©ν°λͺ¨λ¬ λ²€μΉλ§ν¬μμ μ΅μ²¨λ¨ μ±λ₯ λ¬μ±!
AutoPresent: Designing Structured Visuals from Scratch
·3831 words·18 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Carnegie Mellon University
AUTOPRESENT: μμ°μ΄ λͺ
λ Ήμ΄λ‘ μλ²½ν νλ μ ν
μ΄μ
μ¬λΌμ΄λ μλ μμ±!
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
·3272 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ College of Computer Science and Technology, Zhejiang University
2.5λ
λΆλμ κ΅μ‘ λΉλμ€λ₯Ό νμ©, κ³ νμ§ λ€μ€ λͺ¨λ¬ ν
μ€νΈλΆ μ½νΌμ€ κ΅¬μΆ λ° VLMs μ¬μ νμ΅ μ±λ₯ ν₯μ
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
·3245 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Zhejiang University
VideoRefer Suiteλ μ κ΅ν 곡κ°-μκ°μ κ°μ²΄ μ΄ν΄λ₯Ό μν μλ‘μ΄ λΉλμ€ LLM(VideoRefer)κ³Ό λκ·λͺ¨ κ³ νμ§ λ°μ΄ν°μ
(VideoRefer-700K), μ’
ν©μ μΈ λ²€μΉλ§ν¬(VideoRefer-Bench)λ₯Ό μ μν©λλ€.
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
·3155 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Integrated Vision Language Lab, KAIST
λ©ν° λΉμ μΌμ λ°μ΄ν°μ λν VLMsμ μ΄ν΄λ ν₯μμ μν μλ‘μ΄ λ²€μΉλ§ν¬(MS-PR)μ DNA μ΅μ ν κΈ°λ² μ μ
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
·2961 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Hong Kong University of Science and Technology
OS-Genesisλ μλ°©ν₯ μμ
ν©μ±μ ν΅ν΄ GUI μμ΄μ νΈ κΆ€μ μμ± μλν λ¬Έμ λ₯Ό ν΄κ²°νλ νμ μ μΈ νμ΄νλΌμΈμ
λλ€.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
·2870 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Xi'an Jiaotong University
LaDeCo: κ³μΈ΅μ μ κ·Ό λ°©μμ μ¬μ©ν μλ κ·Έλν½ λμμΈ ν©μ±
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
·3029 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Shanghai AI Laboratory
μκ°μ κ³Όμ μ λ ¬μ ν΅ν μμ
μ νΈλ μ΅μ ν(TPO)λ‘ λ©ν°λͺ¨λ¬ λκ·λͺ¨ μΈμ΄ λͺ¨λΈμ μ±λ₯μ νκΈ°μ μΌλ‘ ν₯μμμΌ°μ΅λλ€.
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
·3101 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of Bonn
Video-Panda: μ΄κ²½λ μΈμ½λ μλ λΉλμ€-μΈμ΄ λͺ¨λΈλ‘, κ³μ° λΉμ©μ νκΈ°μ μΌλ‘ μ€μ΄λ©΄μ μ΅μ²¨λ¨ μ±λ₯μ λ¬μ±!
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
·2002 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
Mulberryλ μ§λ¨ λͺ¬ν
μΉ΄λ₯Όλ‘ νΈλ¦¬ νμ(CoMCTS)μ μ΄μ©, λ¨κ³μ μΆλ‘ λ° λ°μ± λ₯λ ₯μ κ°μΆ λ€μ€ λͺ¨λ λκ·λͺ¨ μΈμ΄ λͺ¨λΈ(MLLM)μ κ°λ°ν μ°κ΅¬μ
λλ€.
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
·2158 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of Science and Technology of China
Molar: λ©ν°λͺ¨λ¬ LLMκ³Ό νμ
νν°λ§μ κ²°ν©νμ¬ μνμ
μΆμ² μ±λ₯μ νκΈ°μ μΌλ‘ ν₯μμν¨ νμ μ μΈ νλ μμν¬!
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
·2306 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of British Columbia
MMFactory: μ¬μ©μ λ§μΆ€ν λΉμ -μΈμ΄ μμ
μ루μ
κ²μ μμ§
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
·3159 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Human-AI Interaction
π’ Shanghai Jiao Tong University
PC Agentλ μΈκ°μ μΈμ§ κ³Όμ μ AI μ μ μ΄νμ¬ λ³΅μ‘ν λμ§νΈ μμ
μ μλννλ νμ μ μΈ μμ€ν
μ
λλ€.
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
·1340 words·7 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Generation
π’ University of Illinois Urbana-Champaign
κ³ νμ§ λΉλμ€-μ€λμ€ ν©μ±μ μν νμ μ μΈ λ€μ€ λͺ¨λ μ‘°μΈνΈ νμ΅ νλ μμν¬ MMAudio μ μ!
Progressive Multimodal Reasoning via Active Retrieval
·2635 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Reasoning
π’ Gaoling School of Artificial Intelligence, Renmin University of China
AR-MCTS: λ₯λμ κ²μκ³Ό λͺ¬ν
μΉ΄λ₯Όλ‘ νΈλ¦¬ νμμΌλ‘ λ©ν°λͺ¨λ¬ μΆλ‘ ν₯μ
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2165 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Hong Kong University of Science and Technology
MegaPairsλ VLMκ³Ό κ³΅κ° λλ©μΈ μ΄λ―Έμ§λ₯Ό νμ©, 2600λ§ κ° μ΄μμ κ³ νμ§ λ€μ€ λͺ¨λ¬ νμ΅ λ°μ΄ν°λ₯Ό μμ±νμ¬ λ²μ© λ€μ€ λͺ¨λ¬ κ²μ μ±λ₯μ νκΈ°μ μΌλ‘ ν₯μμμΌ°μ΅λλ€.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·2904 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ GenAI, Meta
CrossFlow: λͺ¨λ¬λ¦¬ν° κ° μ§μ μ λ³ν κ°λ₯ν νμ μ νλ μμν¬!
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
·2525 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Generation
π’ Snap Inc
AV-Link: μκ° μ λ ¬ νμ° κΈ°λ₯μ ν΅ν ν¬λ‘μ€ λͺ¨λ¬ μ€λμ€-λΉλμ€ μμ±μ νκΈ°μ μΈ λ°μ !