Skip to main content

Multimodal Learning

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3363 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Tsinghua University
LLaVA-UHD v2λŠ” 계측적 μœˆλ„μš° λ³€ν™˜κΈ°λ₯Ό 이용, 고해상도 νŠΉμ§• ν”ΌλΌλ―Έλ“œλ₯Ό ν†΅ν•©ν•˜μ—¬ λ‹€μ–‘ν•œ μ‹œκ°μ  μ„ΈλΆ€ 정보λ₯Ό ν¬μ°©ν•˜λŠ” ν˜μ‹ μ μΈ 닀쀑 λͺ¨λ‹¬ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€.
GUI Agents: A Survey
·207 words·1 min· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Human-AI Interaction 🏒 University of Maryland
λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈ 기반 GUI μ—μ΄μ „νŠΈ 기술의 μ΅œμ‹  동ν–₯을 μ’…ν•©μ μœΌλ‘œ λΆ„μ„ν•˜κ³ , 벀치마크, 평가 μ§€ν‘œ, μ•„ν‚€ν…μ²˜, ν•™μŠ΅ 방법을 μ²΄κ³„μ μœΌλ‘œ λΆ„λ₯˜ν•˜μ—¬ 톡합 ν”„λ ˆμž„μ›Œν¬λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2500 words·12 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Hong Kong University of Science and Technology
μ‹œκ° μ „λ¬Έκ°€ λͺ¨λΈμ„ ν™œμš©ν•œ 이미지 μΊ‘μ…˜ ν–₯μƒμœΌλ‘œ 닀쀑 λͺ¨λ‹¬ λͺ¨λΈ μ„±λŠ₯ κ°œμ„ 
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
·2232 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Shanghai Jiao Tong University
GeoX: MLLM보닀 λ›°μ–΄λ‚œ κΈ°ν•˜ν•™μ  문제 해결사!
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
·2277 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 NVIDIA Research
ResGen, κ³ ν’ˆμ§ˆ 생성과 λΉ λ₯Έ μƒ˜ν”Œλ§ 속도λ₯Ό λͺ¨λ‘ λ‹¬μ„±ν•˜λŠ” 효율적인 RVQ 기반 생성 λͺ¨λΈ.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1707 words·9 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Meta GenAI
Apollo: λŒ€κ·œλͺ¨ λ©€ν‹°λͺ¨λ‹¬ λͺ¨λΈμ˜ λΉ„λ””μ˜€ 이해λ₯Ό μœ„ν•œ 심측 탐ꡬ.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3268 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Tsinghua University
SynerGen-VL: κ°„λ‹¨ν•œ ꡬ쑰둜 이미지 이해 및 생성을 λ™μ‹œμ— μˆ˜ν–‰ν•˜λŠ” κ°•λ ₯ν•œ MLLM.
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
·2344 words·12 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Multimodal Generation 🏒 University of Edinburgh
VMBλŠ” ν…μŠ€νŠΈ 및 μŒμ•… λΈŒλ¦¬μ§€λ₯Ό ν™œμš©ν•˜μ—¬ λ©€ν‹°λͺ¨λ‹¬ μŒμ•… 생성을 μœ„ν•œ μƒˆλ‘­κ³  μ œμ–΄ κ°€λŠ₯ν•œ ν”„λ ˆμž„μ›Œν¬λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
·3354 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Human-AI Interaction 🏒 Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2.5-OmniLive: μ‹€μ‹œκ°„ 슀트리밍 λΉ„λ””μ˜€ 및 μ˜€λ””μ˜€ μƒν˜Έμž‘μš©μ„ μœ„ν•œ μΈκ°„μ˜ 인지λŠ₯λ ₯을 λͺ¨λ°©ν•œ ν˜μ‹ μ  닀쀑 λͺ¨λ“œ AI μ‹œμŠ€ν…œ
GenEx: Generating an Explorable World
·2180 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Embodied AI 🏒 Johns Hopkins University
GenEx: 단일 μ΄λ―Έμ§€λ‘œ 탐색 κ°€λŠ₯ν•œ 3D 세계 생성.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2792 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2: μ•„λžμ–΄-μ˜μ–΄ 이쀑 μ–Έμ–΄ 의료 μ „λ¬Έκ°€ LMM μΆœμ‹œ!