Multimodal Learning

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

6 January 2025·1981 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong

Dispider: 실시간 상호작용을 위해 분리된 인식, 결정, 반응을 사용하는 비디오 LLM을 가능하게 합니다.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

3 January 2025·2176 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent Youtu Lab

VITA-1.5: 실시간 시각 및 음성 상호작용을 위한 GPT-40 수준의 다중 모달 LLM

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

3 January 2025·3242 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Gaoling School of Artificial Intelligence, Renmin University of China

Virgo: 텍스트 기반 장문 사고 데이터를 활용, 다양한 멀티모달 벤치마크에서 최첨단 성능 달성!

AutoPresent: Designing Structured Visuals from Scratch

1 January 2025·3831 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

AUTOPRESENT: 자연어 명령어로 완벽한 프레젠테이션 슬라이드 자동 생성!

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

1 January 2025·3272 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 College of Computer Science and Technology, Zhejiang University

2.5년 분량의 교육 비디오를 활용, 고품질 다중 모달 텍스트북 코퍼스 구축 및 VLMs 사전 학습 성능 향상

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

31 December 2024·3245 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhejiang University

VideoRefer Suite는 정교한 공간-시간적 개체 이해를 위한 새로운 비디오 LLM(VideoRefer)과 대규모 고품질 데이터셋(VideoRefer-700K), 종합적인 벤치마크(VideoRefer-Bench)를 제시합니다.

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

30 December 2024·3155 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Integrated Vision Language Lab, KAIST

멀티 비전 센서 데이터에 대한 VLMs의 이해도 향상을 위한 새로운 벤치마크(MS-PR)와 DNA 최적화 기법 제시

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

27 December 2024·2961 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

OS-Genesis는 역방향 작업 합성을 통해 GUI 에이전트 궤적 생성 자동화 문제를 해결하는 혁신적인 파이프라인입니다.

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

27 December 2024·2870 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Xi'an Jiaotong University

LaDeCo: 계층적 접근 방식을 사용한 자동 그래픽 디자인 합성

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

26 December 2024·3029 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai AI Laboratory

시각적 과제 정렬을 통한 작업 선호도 최적화(TPO)로 멀티모달 대규모 언어 모델의 성능을 획기적으로 향상시켰습니다.

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

24 December 2024·3101 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Bonn

Video-Panda: 초경량 인코더 없는 비디오-언어 모델로, 계산 비용을 획기적으로 줄이면서 최첨단 성능을 달성!

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

24 December 2024·2002 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Mulberry는 집단 몬테 카를로 트리 탐색(CoMCTS)을 이용, 단계적 추론 및 반성 능력을 갖춘 다중 모드 대규모 언어 모델(MLLM)을 개발한 연구입니다.

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

24 December 2024·2158 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

Molar: 멀티모달 LLM과 협업 필터링을 결합하여 시퀀셜 추천 성능을 획기적으로 향상시킨 혁신적인 프레임워크!

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

24 December 2024·2306 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of British Columbia

MMFactory: 사용자 맞춤형 비전-언어 작업 솔루션 검색 엔진

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

23 December 2024·3159 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 Shanghai Jiao Tong University

PC Agent는 인간의 인지 과정을 AI 에 전이하여 복잡한 디지털 작업을 자동화하는 혁신적인 시스템입니다.

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

19 December 2024·1340 words·7 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Illinois Urbana-Champaign

고품질 비디오-오디오 합성을 위한 혁신적인 다중 모드 조인트 학습 프레임워크 MMAudio 제안!

Progressive Multimodal Reasoning via Active Retrieval

19 December 2024·2635 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Gaoling School of Artificial Intelligence, Renmin University of China

AR-MCTS: 능동적 검색과 몬테 카를로 트리 탐색으로 멀티모달 추론 향상

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

19 December 2024·2165 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

MegaPairs는 VLM과 공개 도메인 이미지를 활용, 2600만 개 이상의 고품질 다중 모달 학습 데이터를 생성하여 범용 다중 모달 검색 성능을 획기적으로 향상시켰습니다.

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

19 December 2024·2904 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 GenAI, Meta

CrossFlow: 모달리티 간 직접적 변환 가능한 혁신적 프레임워크!

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

19 December 2024·2525 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Snap Inc

AV-Link: 시간 정렬 확산 기능을 통한 크로스 모달 오디오-비디오 생성의 획기적인 발전!