🏢 Hong Kong University of Science and Technology

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2 January 2025·2466 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

VideoAnydoor: 정밀한 모션 제어를 갖춘 고품질 영상 객체 삽입

A3: Android Agent Arena for Mobile GUI Agents

2 January 2025·1920 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Human-AI Interaction 🏢 Hong Kong University of Science and Technology

Android Agent Arena(A3): 실제 모바일 앱에서 AI 에이전트의 동적 성능 평가를 위한 혁신 플랫폼

Edicho: Consistent Image Editing in the Wild

30 December 2024·2213 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology

Edicho: 이미지 간 일관성 유지하며 제로샷 이미지 편집 가능!

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

27 December 2024·2961 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

OS-Genesis는 역방향 작업 합성을 통해 GUI 에이전트 궤적 생성 자동화 문제를 해결하는 혁신적인 파이프라인입니다.

Large Motion Video Autoencoding with Cross-modal Video VAE

23 December 2024·2098 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

고품질 영상 생성 및 효율적 압축을 위한 혁신적인 크로스 모달 비디오 VAE!

Diving into Self-Evolving Training for Multimodal Reasoning

23 December 2024·2584 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Hong Kong University of Science and Technology

M-STAR: 다모달 추론을 위한 자기 진화 훈련의 새로운 프레임워크를 제시!

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

23 December 2024·1797 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Hong Kong University of Science and Technology

B-STAR: 자기 학습 추론자에서 탐색과 활용의 균형을 모니터링하고 조정하여 성능을 향상시키는 새로운 프레임워크

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

19 December 2024·2165 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

MegaPairs는 VLM과 공개 도메인 이미지를 활용, 2600만 개 이상의 고품질 다중 모달 학습 데이터를 생성하여 범용 다중 모달 검색 성능을 획기적으로 향상시켰습니다.

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

19 December 2024·2184 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology

LeviTor: 사용자의 간편한 3D 궤적 입력만으로 사실적인 비디오 합성이 가능한 혁신적인 모델!

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

18 December 2024·2500 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

시각 전문가 모델을 활용한 이미지 캡션 향상으로 다중 모달 모델 성능 개선

AniDoc: Animation Creation Made Easier

18 December 2024·1844 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

AniDoc: 희소 스케치와 참조 이미지를 활용, 2D 애니메이션 자동 채색 및 보간을 구현하는 혁신적 AI 모델!

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

15 December 2024·2657 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Hong Kong University of Science and Technology

GaussianProperty는 LMM을 사용하여 3D 가우시안에 물리적 속성을 통합하는 훈련 없는 프레임워크로, 물리 기반 시뮬레이션 및 로봇 쥐기와 같은 다운스트림 작업을 가능하게 합니다.