Skip to main content

Vision-Language Models

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2165 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Hong Kong University of Science and Technology
MegaPairs๋Š” VLM๊ณผ ๊ณต๊ฐœ ๋„๋ฉ”์ธ ์ด๋ฏธ์ง€๋ฅผ ํ™œ์šฉ, 2600๋งŒ ๊ฐœ ์ด์ƒ์˜ ๊ณ ํ’ˆ์งˆ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๋ฒ”์šฉ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ํš๊ธฐ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·2904 words·14 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข GenAI, Meta
CrossFlow: ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„ ์ง์ ‘์  ๋ณ€ํ™˜ ๊ฐ€๋Šฅํ•œ ํ˜์‹ ์  ํ”„๋ ˆ์ž„์›Œํฌ!
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3363 words·16 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Tsinghua University
LLaVA-UHD v2๋Š” ๊ณ„์ธต์  ์œˆ๋„์šฐ ๋ณ€ํ™˜๊ธฐ๋ฅผ ์ด์šฉ, ๊ณ ํ•ด์ƒ๋„ ํŠน์ง• ํ”ผ๋ผ๋ฏธ๋“œ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์  ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ํฌ์ฐฉํ•˜๋Š” ํ˜์‹ ์ ์ธ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2500 words·12 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Hong Kong University of Science and Technology
์‹œ๊ฐ ์ „๋ฌธ๊ฐ€ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ์ด๋ฏธ์ง€ ์บก์…˜ ํ–ฅ์ƒ์œผ๋กœ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ 
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
·2232 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Shanghai Jiao Tong University
GeoX: MLLM๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ๊ธฐํ•˜ํ•™์  ๋ฌธ์ œ ํ•ด๊ฒฐ์‚ฌ!
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
·2277 words·11 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข NVIDIA Research
ResGen, ๊ณ ํ’ˆ์งˆ ์ƒ์„ฑ๊ณผ ๋น ๋ฅธ ์ƒ˜ํ”Œ๋ง ์†๋„๋ฅผ ๋ชจ๋‘ ๋‹ฌ์„ฑํ•˜๋Š” ํšจ์œจ์ ์ธ RVQ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1707 words·9 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Meta GenAI
Apollo: ๋Œ€๊ทœ๋ชจ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ ๋น„๋””์˜ค ์ดํ•ด๋ฅผ ์œ„ํ•œ ์‹ฌ์ธต ํƒ๊ตฌ.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3268 words·16 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Tsinghua University
SynerGen-VL: ๊ฐ„๋‹จํ•œ ๊ตฌ์กฐ๋กœ ์ด๋ฏธ์ง€ ์ดํ•ด ๋ฐ ์ƒ์„ฑ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฐ•๋ ฅํ•œ MLLM.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2792 words·14 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2: ์•„๋ž์–ด-์˜์–ด ์ด์ค‘ ์–ธ์–ด ์˜๋ฃŒ ์ „๋ฌธ๊ฐ€ LMM ์ถœ์‹œ!