Skip to main content

Multimodal Learning

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
·1981 words·10 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Chinese University of Hong Kong
Dispider: μ‹€μ‹œκ°„ μƒν˜Έμž‘μš©μ„ μœ„ν•΄ λΆ„λ¦¬λœ 인식, κ²°μ •, λ°˜μ‘μ„ μ‚¬μš©ν•˜λŠ” λΉ„λ””μ˜€ LLM을 κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
·2176 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Tencent Youtu Lab
VITA-1.5: μ‹€μ‹œκ°„ μ‹œκ° 및 μŒμ„± μƒν˜Έμž‘μš©μ„ μœ„ν•œ GPT-40 μˆ˜μ€€μ˜ 닀쀑 λͺ¨λ‹¬ LLM
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
·3242 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Multimodal Reasoning 🏒 Gaoling School of Artificial Intelligence, Renmin University of China
Virgo: ν…μŠ€νŠΈ 기반 μž₯λ¬Έ 사고 데이터λ₯Ό ν™œμš©, λ‹€μ–‘ν•œ λ©€ν‹°λͺ¨λ‹¬ λ²€μΉ˜λ§ˆν¬μ—μ„œ μ΅œμ²¨λ‹¨ μ„±λŠ₯ 달성!
AutoPresent: Designing Structured Visuals from Scratch
·3831 words·18 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Carnegie Mellon University
AUTOPRESENT: μžμ—°μ–΄ λͺ…λ Ήμ–΄λ‘œ μ™„λ²½ν•œ ν”„λ ˆμ  ν…Œμ΄μ…˜ μŠ¬λΌμ΄λ“œ μžλ™ 생성!
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
·3272 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 College of Computer Science and Technology, Zhejiang University
2.5λ…„ λΆ„λŸ‰μ˜ ꡐ윑 λΉ„λ””μ˜€λ₯Ό ν™œμš©, κ³ ν’ˆμ§ˆ 닀쀑 λͺ¨λ‹¬ ν…μŠ€νŠΈλΆ μ½”νΌμŠ€ ꡬ좕 및 VLMs 사전 ν•™μŠ΅ μ„±λŠ₯ ν–₯상
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
·3245 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Zhejiang University
VideoRefer SuiteλŠ” μ •κ΅ν•œ 곡간-μ‹œκ°„μ  개체 이해λ₯Ό μœ„ν•œ μƒˆλ‘œμš΄ λΉ„λ””μ˜€ LLM(VideoRefer)κ³Ό λŒ€κ·œλͺ¨ κ³ ν’ˆμ§ˆ 데이터셋(VideoRefer-700K), 쒅합적인 벀치마크(VideoRefer-Bench)λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€.
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
·3155 words·15 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Integrated Vision Language Lab, KAIST
λ©€ν‹° λΉ„μ „ μ„Όμ„œ 데이터에 λŒ€ν•œ VLMs의 이해도 ν–₯상을 μœ„ν•œ μƒˆλ‘œμš΄ 벀치마크(MS-PR)와 DNA μ΅œμ ν™” 기법 μ œμ‹œ
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
·2961 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Hong Kong University of Science and Technology
OS-GenesisλŠ” μ—­λ°©ν–₯ μž‘μ—… 합성을 톡해 GUI μ—μ΄μ „νŠΈ ꢀ적 생성 μžλ™ν™” 문제λ₯Ό ν•΄κ²°ν•˜λŠ” ν˜μ‹ μ μΈ νŒŒμ΄ν”„λΌμΈμž…λ‹ˆλ‹€.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
·2870 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Xi'an Jiaotong University
LaDeCo: 계측적 μ ‘κ·Ό 방식을 μ‚¬μš©ν•œ μžλ™ κ·Έλž˜ν”½ λ””μžμΈ ν•©μ„±
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
·3029 words·15 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Shanghai AI Laboratory
μ‹œκ°μ  과제 정렬을 ν†΅ν•œ μž‘μ—… μ„ ν˜Έλ„ μ΅œμ ν™”(TPO)둜 λ©€ν‹°λͺ¨λ‹¬ λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈμ˜ μ„±λŠ₯을 획기적으둜 ν–₯μƒμ‹œμΌ°μŠ΅λ‹ˆλ‹€.
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
·3101 words·15 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 University of Bonn
Video-Panda: μ΄ˆκ²½λŸ‰ 인코더 μ—†λŠ” λΉ„λ””μ˜€-μ–Έμ–΄ λͺ¨λΈλ‘œ, 계산 λΉ„μš©μ„ 획기적으둜 μ€„μ΄λ©΄μ„œ μ΅œμ²¨λ‹¨ μ„±λŠ₯을 달성!
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
·2002 words·10 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Tsinghua University
MulberryλŠ” 집단 λͺ¬ν…Œ μΉ΄λ₯Όλ‘œ 트리 탐색(CoMCTS)을 이용, 단계적 μΆ”λ‘  및 λ°˜μ„± λŠ₯λ ₯을 κ°–μΆ˜ 닀쀑 λͺ¨λ“œ λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈ(MLLM)을 κ°œλ°œν•œ μ—°κ΅¬μž…λ‹ˆλ‹€.
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
·2158 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 University of Science and Technology of China
Molar: λ©€ν‹°λͺ¨λ‹¬ LLMκ³Ό ν˜‘μ—… 필터링을 κ²°ν•©ν•˜μ—¬ μ‹œν€€μ…œ μΆ”μ²œ μ„±λŠ₯을 획기적으둜 ν–₯μƒμ‹œν‚¨ ν˜μ‹ μ μΈ ν”„λ ˆμž„μ›Œν¬!
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
·2306 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 University of British Columbia
MMFactory: μ‚¬μš©μž λ§žμΆ€ν˜• λΉ„μ „-μ–Έμ–΄ μž‘μ—… μ†”λ£¨μ…˜ 검색 엔진
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
·3159 words·15 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Human-AI Interaction 🏒 Shanghai Jiao Tong University
PC AgentλŠ” μΈκ°„μ˜ 인지 과정을 AI 에 μ „μ΄ν•˜μ—¬ λ³΅μž‘ν•œ 디지털 μž‘μ—…μ„ μžλ™ν™”ν•˜λŠ” ν˜μ‹ μ μΈ μ‹œμŠ€ν…œμž…λ‹ˆλ‹€.
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
·1340 words·7 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Multimodal Generation 🏒 University of Illinois Urbana-Champaign
κ³ ν’ˆμ§ˆ λΉ„λ””μ˜€-μ˜€λ””μ˜€ 합성을 μœ„ν•œ ν˜μ‹ μ μΈ 닀쀑 λͺ¨λ“œ 쑰인트 ν•™μŠ΅ ν”„λ ˆμž„μ›Œν¬ MMAudio μ œμ•ˆ!
Progressive Multimodal Reasoning via Active Retrieval
·2635 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Multimodal Reasoning 🏒 Gaoling School of Artificial Intelligence, Renmin University of China
AR-MCTS: λŠ₯동적 검색과 λͺ¬ν…Œ μΉ΄λ₯Όλ‘œ 트리 νƒμƒ‰μœΌλ‘œ λ©€ν‹°λͺ¨λ‹¬ μΆ”λ‘  ν–₯상
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2165 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Hong Kong University of Science and Technology
MegaPairsλŠ” VLMκ³Ό 곡개 도메인 이미지λ₯Ό ν™œμš©, 2600만 개 μ΄μƒμ˜ κ³ ν’ˆμ§ˆ 닀쀑 λͺ¨λ‹¬ ν•™μŠ΅ 데이터λ₯Ό μƒμ„±ν•˜μ—¬ λ²”μš© 닀쀑 λͺ¨λ‹¬ 검색 μ„±λŠ₯을 획기적으둜 ν–₯μƒμ‹œμΌ°μŠ΅λ‹ˆλ‹€.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·2904 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 GenAI, Meta
CrossFlow: λͺ¨λ‹¬λ¦¬ν‹° κ°„ 직접적 λ³€ν™˜ κ°€λŠ₯ν•œ ν˜μ‹ μ  ν”„λ ˆμž„μ›Œν¬!
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
·2525 words·12 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Multimodal Generation 🏒 Snap Inc
AV-Link: μ‹œκ°„ μ •λ ¬ ν™•μ‚° κΈ°λŠ₯을 ν†΅ν•œ 크둜슀 λͺ¨λ‹¬ μ˜€λ””μ˜€-λΉ„λ””μ˜€ μƒμ„±μ˜ 획기적인 λ°œμ „!