Skip to main content

Vision-Language Models

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
·1981 words·10 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Chinese University of Hong Kong
Dispider: ์‹ค์‹œ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•ด ๋ถ„๋ฆฌ๋œ ์ธ์‹, ๊ฒฐ์ •, ๋ฐ˜์‘์„ ์‚ฌ์šฉํ•˜๋Š” ๋น„๋””์˜ค LLM์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
·2176 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Tencent Youtu Lab
VITA-1.5: ์‹ค์‹œ๊ฐ„ ์‹œ๊ฐ ๋ฐ ์Œ์„ฑ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•œ GPT-40 ์ˆ˜์ค€์˜ ๋‹ค์ค‘ ๋ชจ๋‹ฌ LLM
AutoPresent: Designing Structured Visuals from Scratch
·3831 words·18 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Carnegie Mellon University
AUTOPRESENT: ์ž์—ฐ์–ด ๋ช…๋ น์–ด๋กœ ์™„๋ฒฝํ•œ ํ”„๋ ˆ์  ํ…Œ์ด์…˜ ์Šฌ๋ผ์ด๋“œ ์ž๋™ ์ƒ์„ฑ!
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
·3272 words·16 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข College of Computer Science and Technology, Zhejiang University
2.5๋…„ ๋ถ„๋Ÿ‰์˜ ๊ต์œก ๋น„๋””์˜ค๋ฅผ ํ™œ์šฉ, ๊ณ ํ’ˆ์งˆ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํ…์ŠคํŠธ๋ถ ์ฝ”ํผ์Šค ๊ตฌ์ถ• ๋ฐ VLMs ์‚ฌ์ „ ํ•™์Šต ์„ฑ๋Šฅ ํ–ฅ์ƒ
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
·3245 words·16 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Zhejiang University
VideoRefer Suite๋Š” ์ •๊ตํ•œ ๊ณต๊ฐ„-์‹œ๊ฐ„์  ๊ฐœ์ฒด ์ดํ•ด๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋น„๋””์˜ค LLM(VideoRefer)๊ณผ ๋Œ€๊ทœ๋ชจ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹(VideoRefer-700K), ์ข…ํ•ฉ์ ์ธ ๋ฒค์น˜๋งˆํฌ(VideoRefer-Bench)๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
·3155 words·15 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Integrated Vision Language Lab, KAIST
๋ฉ€ํ‹ฐ ๋น„์ „ ์„ผ์„œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ VLMs์˜ ์ดํ•ด๋„ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ(MS-PR)์™€ DNA ์ตœ์ ํ™” ๊ธฐ๋ฒ• ์ œ์‹œ
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
·2961 words·14 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Hong Kong University of Science and Technology
OS-Genesis๋Š” ์—ญ๋ฐฉํ–ฅ ์ž‘์—… ํ•ฉ์„ฑ์„ ํ†ตํ•ด GUI ์—์ด์ „ํŠธ ๊ถค์  ์ƒ์„ฑ ์ž๋™ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ˜์‹ ์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์ž…๋‹ˆ๋‹ค.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
·2870 words·14 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Xi'an Jiaotong University
LaDeCo: ๊ณ„์ธต์  ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ ์ž๋™ ๊ทธ๋ž˜ํ”ฝ ๋””์ž์ธ ํ•ฉ์„ฑ
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
·3029 words·15 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Shanghai AI Laboratory
์‹œ๊ฐ์  ๊ณผ์ œ ์ •๋ ฌ์„ ํ†ตํ•œ ์ž‘์—… ์„ ํ˜ธ๋„ ์ตœ์ ํ™”(TPO)๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํš๊ธฐ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
·3101 words·15 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข University of Bonn
Video-Panda: ์ดˆ๊ฒฝ๋Ÿ‰ ์ธ์ฝ”๋” ์—†๋Š” ๋น„๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ๋กœ, ๊ณ„์‚ฐ ๋น„์šฉ์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ด๋ฉด์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ!
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
·2002 words·10 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Tsinghua University
Mulberry๋Š” ์ง‘๋‹จ ๋ชฌํ…Œ ์นด๋ฅผ๋กœ ํŠธ๋ฆฌ ํƒ์ƒ‰(CoMCTS)์„ ์ด์šฉ, ๋‹จ๊ณ„์  ์ถ”๋ก  ๋ฐ ๋ฐ˜์„ฑ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ ๋‹ค์ค‘ ๋ชจ๋“œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(MLLM)์„ ๊ฐœ๋ฐœํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
·2158 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข University of Science and Technology of China
Molar: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM๊ณผ ํ˜‘์—… ํ•„ํ„ฐ๋ง์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์‹œํ€€์…œ ์ถ”์ฒœ ์„ฑ๋Šฅ์„ ํš๊ธฐ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ ํ˜์‹ ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ!
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
·2306 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข University of British Columbia
MMFactory: ์‚ฌ์šฉ์ž ๋งž์ถคํ˜• ๋น„์ „-์–ธ์–ด ์ž‘์—… ์†”๋ฃจ์…˜ ๊ฒ€์ƒ‰ ์—”์ง„
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2165 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Hong Kong University of Science and Technology
MegaPairs๋Š” VLM๊ณผ ๊ณต๊ฐœ ๋„๋ฉ”์ธ ์ด๋ฏธ์ง€๋ฅผ ํ™œ์šฉ, 2600๋งŒ ๊ฐœ ์ด์ƒ์˜ ๊ณ ํ’ˆ์งˆ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๋ฒ”์šฉ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ํš๊ธฐ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·2904 words·14 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข GenAI, Meta
CrossFlow: ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„ ์ง์ ‘์  ๋ณ€ํ™˜ ๊ฐ€๋Šฅํ•œ ํ˜์‹ ์  ํ”„๋ ˆ์ž„์›Œํฌ!
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3363 words·16 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Tsinghua University
LLaVA-UHD v2๋Š” ๊ณ„์ธต์  ์œˆ๋„์šฐ ๋ณ€ํ™˜๊ธฐ๋ฅผ ์ด์šฉ, ๊ณ ํ•ด์ƒ๋„ ํŠน์ง• ํ”ผ๋ผ๋ฏธ๋“œ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์  ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ํฌ์ฐฉํ•˜๋Š” ํ˜์‹ ์ ์ธ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2500 words·12 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Hong Kong University of Science and Technology
์‹œ๊ฐ ์ „๋ฌธ๊ฐ€ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ์ด๋ฏธ์ง€ ์บก์…˜ ํ–ฅ์ƒ์œผ๋กœ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ 
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
·2232 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Shanghai Jiao Tong University
GeoX: MLLM๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ๊ธฐํ•˜ํ•™์  ๋ฌธ์ œ ํ•ด๊ฒฐ์‚ฌ!
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
·2277 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข NVIDIA Research
ResGen, ๊ณ ํ’ˆ์งˆ ์ƒ์„ฑ๊ณผ ๋น ๋ฅธ ์ƒ˜ํ”Œ๋ง ์†๋„๋ฅผ ๋ชจ๋‘ ๋‹ฌ์„ฑํ•˜๋Š” ํšจ์œจ์ ์ธ RVQ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1707 words·9 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Meta GenAI
Apollo: ๋Œ€๊ทœ๋ชจ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ ๋น„๋””์˜ค ์ดํ•ด๋ฅผ ์œ„ํ•œ ์‹ฌ์ธต ํƒ๊ตฌ.