Vision-Language Models
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
·1981 words·10 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Chinese University of Hong Kong
Dispider: ์ค์๊ฐ ์ํธ์์ฉ์ ์ํด ๋ถ๋ฆฌ๋ ์ธ์, ๊ฒฐ์ , ๋ฐ์์ ์ฌ์ฉํ๋ ๋น๋์ค LLM์ ๊ฐ๋ฅํ๊ฒ ํฉ๋๋ค.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
·2176 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Tencent Youtu Lab
VITA-1.5: ์ค์๊ฐ ์๊ฐ ๋ฐ ์์ฑ ์ํธ์์ฉ์ ์ํ GPT-40 ์์ค์ ๋ค์ค ๋ชจ๋ฌ LLM
AutoPresent: Designing Structured Visuals from Scratch
·3831 words·18 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Carnegie Mellon University
AUTOPRESENT: ์์ฐ์ด ๋ช
๋ น์ด๋ก ์๋ฒฝํ ํ๋ ์ ํ
์ด์
์ฌ๋ผ์ด๋ ์๋ ์์ฑ!
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
·3272 words·16 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข College of Computer Science and Technology, Zhejiang University
2.5๋
๋ถ๋์ ๊ต์ก ๋น๋์ค๋ฅผ ํ์ฉ, ๊ณ ํ์ง ๋ค์ค ๋ชจ๋ฌ ํ
์คํธ๋ถ ์ฝํผ์ค ๊ตฌ์ถ ๋ฐ VLMs ์ฌ์ ํ์ต ์ฑ๋ฅ ํฅ์
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
·3245 words·16 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Zhejiang University
VideoRefer Suite๋ ์ ๊ตํ ๊ณต๊ฐ-์๊ฐ์ ๊ฐ์ฒด ์ดํด๋ฅผ ์ํ ์๋ก์ด ๋น๋์ค LLM(VideoRefer)๊ณผ ๋๊ท๋ชจ ๊ณ ํ์ง ๋ฐ์ดํฐ์
(VideoRefer-700K), ์ข
ํฉ์ ์ธ ๋ฒค์น๋งํฌ(VideoRefer-Bench)๋ฅผ ์ ์ํฉ๋๋ค.
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
·3155 words·15 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Integrated Vision Language Lab, KAIST
๋ฉํฐ ๋น์ ์ผ์ ๋ฐ์ดํฐ์ ๋ํ VLMs์ ์ดํด๋ ํฅ์์ ์ํ ์๋ก์ด ๋ฒค์น๋งํฌ(MS-PR)์ DNA ์ต์ ํ ๊ธฐ๋ฒ ์ ์
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
·2961 words·14 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Hong Kong University of Science and Technology
OS-Genesis๋ ์ญ๋ฐฉํฅ ์์
ํฉ์ฑ์ ํตํด GUI ์์ด์ ํธ ๊ถค์ ์์ฑ ์๋ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ํ์ ์ ์ธ ํ์ดํ๋ผ์ธ์
๋๋ค.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
·2870 words·14 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Xi'an Jiaotong University
LaDeCo: ๊ณ์ธต์ ์ ๊ทผ ๋ฐฉ์์ ์ฌ์ฉํ ์๋ ๊ทธ๋ํฝ ๋์์ธ ํฉ์ฑ
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
·3029 words·15 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Shanghai AI Laboratory
์๊ฐ์ ๊ณผ์ ์ ๋ ฌ์ ํตํ ์์
์ ํธ๋ ์ต์ ํ(TPO)๋ก ๋ฉํฐ๋ชจ๋ฌ ๋๊ท๋ชจ ์ธ์ด ๋ชจ๋ธ์ ์ฑ๋ฅ์ ํ๊ธฐ์ ์ผ๋ก ํฅ์์์ผฐ์ต๋๋ค.
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
·3101 words·15 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข University of Bonn
Video-Panda: ์ด๊ฒฝ๋ ์ธ์ฝ๋ ์๋ ๋น๋์ค-์ธ์ด ๋ชจ๋ธ๋ก, ๊ณ์ฐ ๋น์ฉ์ ํ๊ธฐ์ ์ผ๋ก ์ค์ด๋ฉด์ ์ต์ฒจ๋จ ์ฑ๋ฅ์ ๋ฌ์ฑ!
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
·2002 words·10 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Tsinghua University
Mulberry๋ ์ง๋จ ๋ชฌํ
์นด๋ฅผ๋ก ํธ๋ฆฌ ํ์(CoMCTS)์ ์ด์ฉ, ๋จ๊ณ์ ์ถ๋ก ๋ฐ ๋ฐ์ฑ ๋ฅ๋ ฅ์ ๊ฐ์ถ ๋ค์ค ๋ชจ๋ ๋๊ท๋ชจ ์ธ์ด ๋ชจ๋ธ(MLLM)์ ๊ฐ๋ฐํ ์ฐ๊ตฌ์
๋๋ค.
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
·2158 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข University of Science and Technology of China
Molar: ๋ฉํฐ๋ชจ๋ฌ LLM๊ณผ ํ์
ํํฐ๋ง์ ๊ฒฐํฉํ์ฌ ์ํ์
์ถ์ฒ ์ฑ๋ฅ์ ํ๊ธฐ์ ์ผ๋ก ํฅ์์ํจ ํ์ ์ ์ธ ํ๋ ์์ํฌ!
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
·2306 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข University of British Columbia
MMFactory: ์ฌ์ฉ์ ๋ง์ถคํ ๋น์ -์ธ์ด ์์
์๋ฃจ์
๊ฒ์ ์์ง
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2165 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Hong Kong University of Science and Technology
MegaPairs๋ VLM๊ณผ ๊ณต๊ฐ ๋๋ฉ์ธ ์ด๋ฏธ์ง๋ฅผ ํ์ฉ, 2600๋ง ๊ฐ ์ด์์ ๊ณ ํ์ง ๋ค์ค ๋ชจ๋ฌ ํ์ต ๋ฐ์ดํฐ๋ฅผ ์์ฑํ์ฌ ๋ฒ์ฉ ๋ค์ค ๋ชจ๋ฌ ๊ฒ์ ์ฑ๋ฅ์ ํ๊ธฐ์ ์ผ๋ก ํฅ์์์ผฐ์ต๋๋ค.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·2904 words·14 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข GenAI, Meta
CrossFlow: ๋ชจ๋ฌ๋ฆฌํฐ ๊ฐ ์ง์ ์ ๋ณํ ๊ฐ๋ฅํ ํ์ ์ ํ๋ ์์ํฌ!
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3363 words·16 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Tsinghua University
LLaVA-UHD v2๋ ๊ณ์ธต์ ์๋์ฐ ๋ณํ๊ธฐ๋ฅผ ์ด์ฉ, ๊ณ ํด์๋ ํน์ง ํผ๋ผ๋ฏธ๋๋ฅผ ํตํฉํ์ฌ ๋ค์ํ ์๊ฐ์ ์ธ๋ถ ์ ๋ณด๋ฅผ ํฌ์ฐฉํ๋ ํ์ ์ ์ธ ๋ค์ค ๋ชจ๋ฌ ์ธ์ด ๋ชจ๋ธ์
๋๋ค.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2500 words·12 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Hong Kong University of Science and Technology
์๊ฐ ์ ๋ฌธ๊ฐ ๋ชจ๋ธ์ ํ์ฉํ ์ด๋ฏธ์ง ์บก์
ํฅ์์ผ๋ก ๋ค์ค ๋ชจ๋ฌ ๋ชจ๋ธ ์ฑ๋ฅ ๊ฐ์
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
·2232 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Shanghai Jiao Tong University
GeoX: MLLM๋ณด๋ค ๋ฐ์ด๋ ๊ธฐํํ์ ๋ฌธ์ ํด๊ฒฐ์ฌ!
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
·2277 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข NVIDIA Research
ResGen, ๊ณ ํ์ง ์์ฑ๊ณผ ๋น ๋ฅธ ์ํ๋ง ์๋๋ฅผ ๋ชจ๋ ๋ฌ์ฑํ๋ ํจ์จ์ ์ธ RVQ ๊ธฐ๋ฐ ์์ฑ ๋ชจ๋ธ.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1707 words·9 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Meta GenAI
Apollo: ๋๊ท๋ชจ ๋ฉํฐ๋ชจ๋ฌ ๋ชจ๋ธ์ ๋น๋์ค ์ดํด๋ฅผ ์ํ ์ฌ์ธต ํ๊ตฌ.