Vision-Language Models

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

6 January 2025·1981 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong

Dispider: 실시간 상호작용을 위해 분리된 인식, 결정, 반응을 사용하는 비디오 LLM을 가능하게 합니다.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

3 January 2025·2176 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent Youtu Lab

VITA-1.5: 실시간 시각 및 음성 상호작용을 위한 GPT-40 수준의 다중 모달 LLM

AutoPresent: Designing Structured Visuals from Scratch

1 January 2025·3831 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

AUTOPRESENT: 자연어 명령어로 완벽한 프레젠테이션 슬라이드 자동 생성!

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

1 January 2025·3272 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 College of Computer Science and Technology, Zhejiang University

2.5년 분량의 교육 비디오를 활용, 고품질 다중 모달 텍스트북 코퍼스 구축 및 VLMs 사전 학습 성능 향상

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

31 December 2024·3245 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhejiang University

VideoRefer Suite는 정교한 공간-시간적 개체 이해를 위한 새로운 비디오 LLM(VideoRefer)과 대규모 고품질 데이터셋(VideoRefer-700K), 종합적인 벤치마크(VideoRefer-Bench)를 제시합니다.

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

30 December 2024·3155 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Integrated Vision Language Lab, KAIST

멀티 비전 센서 데이터에 대한 VLMs의 이해도 향상을 위한 새로운 벤치마크(MS-PR)와 DNA 최적화 기법 제시

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

27 December 2024·2961 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

OS-Genesis는 역방향 작업 합성을 통해 GUI 에이전트 궤적 생성 자동화 문제를 해결하는 혁신적인 파이프라인입니다.

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

27 December 2024·2870 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Xi'an Jiaotong University

LaDeCo: 계층적 접근 방식을 사용한 자동 그래픽 디자인 합성

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

26 December 2024·3029 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai AI Laboratory

시각적 과제 정렬을 통한 작업 선호도 최적화(TPO)로 멀티모달 대규모 언어 모델의 성능을 획기적으로 향상시켰습니다.

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

24 December 2024·3101 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Bonn

Video-Panda: 초경량 인코더 없는 비디오-언어 모델로, 계산 비용을 획기적으로 줄이면서 최첨단 성능을 달성!

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

24 December 2024·2002 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Mulberry는 집단 몬테 카를로 트리 탐색(CoMCTS)을 이용, 단계적 추론 및 반성 능력을 갖춘 다중 모드 대규모 언어 모델(MLLM)을 개발한 연구입니다.

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

24 December 2024·2158 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

Molar: 멀티모달 LLM과 협업 필터링을 결합하여 시퀀셜 추천 성능을 획기적으로 향상시킨 혁신적인 프레임워크!

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

24 December 2024·2306 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of British Columbia

MMFactory: 사용자 맞춤형 비전-언어 작업 솔루션 검색 엔진

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

19 December 2024·2165 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

MegaPairs는 VLM과 공개 도메인 이미지를 활용, 2600만 개 이상의 고품질 다중 모달 학습 데이터를 생성하여 범용 다중 모달 검색 성능을 획기적으로 향상시켰습니다.

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

19 December 2024·2904 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 GenAI, Meta

CrossFlow: 모달리티 간 직접적 변환 가능한 혁신적 프레임워크!

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

18 December 2024·3363 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

LLaVA-UHD v2는 계층적 윈도우 변환기를 이용, 고해상도 특징 피라미드를 통합하여 다양한 시각적 세부 정보를 포착하는 혁신적인 다중 모달 언어 모델입니다.

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

18 December 2024·2500 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

시각 전문가 모델을 활용한 이미지 캡션 향상으로 다중 모달 모델 성능 개선

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

16 December 2024·2232 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University

GeoX: MLLM보다 뛰어난 기하학적 문제 해결사!

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

13 December 2024·2277 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NVIDIA Research

ResGen, 고품질 생성과 빠른 샘플링 속도를 모두 달성하는 효율적인 RVQ 기반 생성 모델.

Apollo: An Exploration of Video Understanding in Large Multimodal Models

13 December 2024·1707 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI

Apollo: 대규모 멀티모달 모델의 비디오 이해를 위한 심층 탐구.