Paper Reviews by AI

TransPixar: Advancing Text-to-Video Generation with Transparency

6 January 2025·2013 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Adobe Research

TransPixar: 제한된 데이터로도 고품질 투명 비디오 생성

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

6 January 2025·2799 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Meta

마스크 기반 모션 경로를 이용한 2단계 이미지-비디오 생성 프레임워크인 THROUGH-THE-MASK가 다중 객체의 정확한 애니메이션을 가능하게 합니다.

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

6 January 2025·3033 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanjing University

STAR: T2V 모델 기반 실세계 비디오 초고해상도 기술로 현실적인 공간적 세부 정보와 견고한 시간적 일관성을 달성!

Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models

6 January 2025·1134 words·6 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Speech Recognition 🏢 SandLogic Technologies Pvt Ltd.

Mamba 아키텍처 기반의 Samba-ASR은 효율적인 상태 공간 모델을 이용, 기존 Transformer 모델의 한계를 극복하고 음성 인식 분야에서 최첨단 성능을 달성했습니다.

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

6 January 2025·1981 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong

Dispider: 실시간 상호작용을 위해 분리된 인식, 결정, 반응을 사용하는 비디오 LLM을 가능하게 합니다.

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

6 January 2025·2104 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Shanghai AI Laboratory

BoostStep: 단계별 추론으로 LLMs의 수학적 능력 향상!

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

6 January 2025·4797 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 Stanford University

AutoConverter는 오픈엔드 방식의 VQA 질문을 다지선다형 질문으로 자동 변환하는 시스템입니다. 이를 통해 VLM(Vision Language Model) 평가의 객관성과 재현성을 높일 수 있습니다. 연구진은 AutoConverter를 사용하여 20개의 기존 VQA 데이터셋을 통합한 VMCBench라는 새로운 벤치마크를 구축했습니다. VMCBen…

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

5 January 2025·3178 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 ByteDance

ToolHop: 대규모 언어 모델의 다중 단계 도구 사용 능력을 엄격히 평가하는 새로운 벤치마크

Test-time Computing: from System-1 Thinking to System-2 Thinking

5 January 2025·699 words·4 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Soochow University

테스트 시간 컴퓨팅을 활용하여 대규모 언어 모델의 추론 능력을 시스템 1 사고에서 시스템 2 사고 수준으로 향상시키는 방법을 제시하는 획기적인 연구!

Scaling Laws for Floating Point Quantization Training

5 January 2025·5642 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tencent AI Lab

부동 소수점 양자화 훈련의 새로운 scaling law 발견: 지수, 맨티사 비트 및 스케일링 인자 계산 정밀도가 LLM 성능에 미치는 영향을 정량적으로 규명

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

5 January 2025·2321 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Multimedia Laboratory, the Chinese University of Hong Kong

GS-DiT: 효율적인 3D 점 추적으로 의사 4D 가우스 필드를 활용, 4D 비디오 제어 가능한 혁신적 비디오 생성 모델

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

5 January 2025·2099 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Science and Technology of China (USTC)

DepthMaster는 단일 단계 확산 모델을 이용, 생성적 특징을 활용하여 모노큘러 깊이 추정의 정확도와 속도를 획기적으로 향상시켰습니다.

Personalized Graph-Based Retrieval for Large Language Models

4 January 2025·3060 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 UC Santa Cruz

개인화된 그래프 기반 검색 증강 생성(PGraphRAG) 프레임워크를 통해 희소 데이터 문제를 해결하고, LLM의 개인화 성능을 크게 향상시켰습니다.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

3 January 2025·2176 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent Youtu Lab

VITA-1.5: 실시간 시각 및 음성 상호작용을 위한 GPT-40 수준의 다중 모달 LLM

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

3 January 2025·3242 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Gaoling School of Artificial Intelligence, Renmin University of China

Virgo: 텍스트 기반 장문 사고 데이터를 활용, 다양한 멀티모달 벤치마크에서 최첨단 성능 달성!

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

3 January 2025·2684 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 University of Southern California

70억 개 매개변수를 가진 메타유전체 기반 대규모 언어 모델(METAGENE-1)이 폐수 데이터로 훈련되어 병원균 탐지 및 유전체 서열 임베딩 작업에서 최첨단 성능을 달성했습니다.

Ingredients: Blending Custom Photos with Video Diffusion Transformers

3 January 2025·2088 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Kunlun Inc.

고품질 다중 ID 맞춤형 비디오 생성을 위한 혁신적인 프레임워크, Ingredients 소개!

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

3 January 2025·2819 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 AgiBot

EnerVerse: 로봇 조작을 위한 미래 공간 생성 프레임워크가 장기간 작업에서 성능 향상을 달성했습니다.

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

3 January 2025·3175 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Ant Group

AUTO-RT: 자동화된 재밍 전략 탐색으로 LLM 취약점 효율적으로 발견!

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2 January 2025·2466 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

VideoAnydoor: 정밀한 모션 제어를 갖춘 고품질 영상 객체 삽입