Vision-Language Models
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2165 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Hong Kong University of Science and Technology
MegaPairs๋ VLM๊ณผ ๊ณต๊ฐ ๋๋ฉ์ธ ์ด๋ฏธ์ง๋ฅผ ํ์ฉ, 2600๋ง ๊ฐ ์ด์์ ๊ณ ํ์ง ๋ค์ค ๋ชจ๋ฌ ํ์ต ๋ฐ์ดํฐ๋ฅผ ์์ฑํ์ฌ ๋ฒ์ฉ ๋ค์ค ๋ชจ๋ฌ ๊ฒ์ ์ฑ๋ฅ์ ํ๊ธฐ์ ์ผ๋ก ํฅ์์์ผฐ์ต๋๋ค.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·2904 words·14 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข GenAI, Meta
CrossFlow: ๋ชจ๋ฌ๋ฆฌํฐ ๊ฐ ์ง์ ์ ๋ณํ ๊ฐ๋ฅํ ํ์ ์ ํ๋ ์์ํฌ!
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3363 words·16 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Tsinghua University
LLaVA-UHD v2๋ ๊ณ์ธต์ ์๋์ฐ ๋ณํ๊ธฐ๋ฅผ ์ด์ฉ, ๊ณ ํด์๋ ํน์ง ํผ๋ผ๋ฏธ๋๋ฅผ ํตํฉํ์ฌ ๋ค์ํ ์๊ฐ์ ์ธ๋ถ ์ ๋ณด๋ฅผ ํฌ์ฐฉํ๋ ํ์ ์ ์ธ ๋ค์ค ๋ชจ๋ฌ ์ธ์ด ๋ชจ๋ธ์
๋๋ค.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2500 words·12 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Hong Kong University of Science and Technology
์๊ฐ ์ ๋ฌธ๊ฐ ๋ชจ๋ธ์ ํ์ฉํ ์ด๋ฏธ์ง ์บก์
ํฅ์์ผ๋ก ๋ค์ค ๋ชจ๋ฌ ๋ชจ๋ธ ์ฑ๋ฅ ๊ฐ์
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
·2232 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Shanghai Jiao Tong University
GeoX: MLLM๋ณด๋ค ๋ฐ์ด๋ ๊ธฐํํ์ ๋ฌธ์ ํด๊ฒฐ์ฌ!
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
·2277 words·11 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข NVIDIA Research
ResGen, ๊ณ ํ์ง ์์ฑ๊ณผ ๋น ๋ฅธ ์ํ๋ง ์๋๋ฅผ ๋ชจ๋ ๋ฌ์ฑํ๋ ํจ์จ์ ์ธ RVQ ๊ธฐ๋ฐ ์์ฑ ๋ชจ๋ธ.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1707 words·9 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Meta GenAI
Apollo: ๋๊ท๋ชจ ๋ฉํฐ๋ชจ๋ฌ ๋ชจ๋ธ์ ๋น๋์ค ์ดํด๋ฅผ ์ํ ์ฌ์ธต ํ๊ตฌ.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3268 words·16 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Tsinghua University
SynerGen-VL: ๊ฐ๋จํ ๊ตฌ์กฐ๋ก ์ด๋ฏธ์ง ์ดํด ๋ฐ ์์ฑ์ ๋์์ ์ํํ๋ ๊ฐ๋ ฅํ MLLM.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2792 words·14 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2: ์๋์ด-์์ด ์ด์ค ์ธ์ด ์๋ฃ ์ ๋ฌธ๊ฐ LMM ์ถ์!