Paper Reviews by AI
2025
TransPixar: Advancing Text-to-Video Generation with Transparency
·2013 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Adobe Research
TransPixar: μ νλ λ°μ΄ν°λ‘λ κ³ νμ§ ν¬λͺ
λΉλμ€ μμ±
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
·2799 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Meta
λ§μ€ν¬ κΈ°λ° λͺ¨μ
κ²½λ‘λ₯Ό μ΄μ©ν 2λ¨κ³ μ΄λ―Έμ§-λΉλμ€ μμ± νλ μμν¬μΈ THROUGH-THE-MASKκ° λ€μ€ κ°μ²΄μ μ νν μ λλ©μ΄μ
μ κ°λ₯νκ² ν©λλ€.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
·3033 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Nanjing University
STAR: T2V λͺ¨λΈ κΈ°λ° μ€μΈκ³ λΉλμ€ μ΄κ³ ν΄μλ κΈ°μ λ‘ νμ€μ μΈ κ³΅κ°μ μΈλΆ μ 보μ κ²¬κ³ ν μκ°μ μΌκ΄μ±μ λ¬μ±!
Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models
·1134 words·6 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Speech Recognition
π’ SandLogic Technologies Pvt Ltd.
Mamba μν€ν
μ² κΈ°λ°μ Samba-ASRμ ν¨μ¨μ μΈ μν κ³΅κ° λͺ¨λΈμ μ΄μ©, κΈ°μ‘΄ Transformer λͺ¨λΈμ νκ³λ₯Ό 극볡νκ³ μμ± μΈμ λΆμΌμμ μ΅μ²¨λ¨ μ±λ₯μ λ¬μ±νμ΅λλ€.
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
·1981 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Chinese University of Hong Kong
Dispider: μ€μκ° μνΈμμ©μ μν΄ λΆλ¦¬λ μΈμ, κ²°μ , λ°μμ μ¬μ©νλ λΉλμ€ LLMμ κ°λ₯νκ² ν©λλ€.
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
·2104 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ Shanghai AI Laboratory
BoostStep: λ¨κ³λ³ μΆλ‘ μΌλ‘ LLMsμ μνμ λ₯λ ₯ ν₯μ!
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
·4797 words·23 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Question Answering
π’ Stanford University
AutoConverterλ μ€νμλ λ°©μμ VQA μ§λ¬Έμ λ€μ§μ λ€ν μ§λ¬ΈμΌλ‘ μλ λ³ννλ μμ€ν
μ
λλ€. μ΄λ₯Ό ν΅ν΄ VLM(Vision Language Model) νκ°μ κ°κ΄μ±κ³Ό μ¬νμ±μ λμΌ μ μμ΅λλ€. μ°κ΅¬μ§μ AutoConverterλ₯Ό μ¬μ©νμ¬ 20κ°μ κΈ°μ‘΄ VQA λ°μ΄ν°μ
μ ν΅ν©ν VMCBenchλΌλ μλ‘μ΄ λ²€μΉλ§ν¬λ₯Ό ꡬμΆνμ΅λλ€. VMCBen…
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
·3178 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ ByteDance
ToolHop: λκ·λͺ¨ μΈμ΄ λͺ¨λΈμ λ€μ€ λ¨κ³ λꡬ μ¬μ© λ₯λ ₯μ μ격ν νκ°νλ μλ‘μ΄ λ²€μΉλ§ν¬
Test-time Computing: from System-1 Thinking to System-2 Thinking
·699 words·4 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ Soochow University
ν
μ€νΈ μκ° μ»΄ν¨ν
μ νμ©νμ¬ λκ·λͺ¨ μΈμ΄ λͺ¨λΈμ μΆλ‘ λ₯λ ₯μ μμ€ν
1 μ¬κ³ μμ μμ€ν
2 μ¬κ³ μμ€μΌλ‘ ν₯μμν€λ λ°©λ²μ μ μνλ νκΈ°μ μΈ μ°κ΅¬!
Scaling Laws for Floating Point Quantization Training
·5642 words·27 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ Tencent AI Lab
λΆλ μμμ μμν νλ ¨μ μλ‘μ΄ scaling law λ°κ²¬: μ§μ, 맨ν°μ¬ λΉνΈ λ° μ€μΌμΌλ§ μΈμ κ³μ° μ λ°λκ° LLM μ±λ₯μ λ―ΈμΉλ μν₯μ μ λμ μΌλ‘ κ·λͺ
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
·2321 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Multimedia Laboratory, the Chinese University of Hong Kong
GS-DiT: ν¨μ¨μ μΈ 3D μ μΆμ μΌλ‘ μμ¬ 4D κ°μ°μ€ νλλ₯Ό νμ©, 4D λΉλμ€ μ μ΄ κ°λ₯ν νμ μ λΉλμ€ μμ± λͺ¨λΈ
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
·2099 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ University of Science and Technology of China (USTC)
DepthMasterλ λ¨μΌ λ¨κ³ νμ° λͺ¨λΈμ μ΄μ©, μμ±μ νΉμ§μ νμ©νμ¬ λͺ¨λ
Ένλ¬ κΉμ΄ μΆμ μ μ νλμ μλλ₯Ό νκΈ°μ μΌλ‘ ν₯μμμΌ°μ΅λλ€.
Personalized Graph-Based Retrieval for Large Language Models
·3060 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ UC Santa Cruz
κ°μΈνλ κ·Έλν κΈ°λ° κ²μ μ¦κ° μμ±(PGraphRAG) νλ μμν¬λ₯Ό ν΅ν΄ ν¬μ λ°μ΄ν° λ¬Έμ λ₯Ό ν΄κ²°νκ³ , LLMμ κ°μΈν μ±λ₯μ ν¬κ² ν₯μμμΌ°μ΅λλ€.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
·2176 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tencent Youtu Lab
VITA-1.5: μ€μκ° μκ° λ° μμ± μνΈμμ©μ μν GPT-40 μμ€μ λ€μ€ λͺ¨λ¬ LLM
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
·3242 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Reasoning
π’ Gaoling School of Artificial Intelligence, Renmin University of China
Virgo: ν
μ€νΈ κΈ°λ° μ₯λ¬Έ μ¬κ³ λ°μ΄ν°λ₯Ό νμ©, λ€μν λ©ν°λͺ¨λ¬ λ²€μΉλ§ν¬μμ μ΅μ²¨λ¨ μ±λ₯ λ¬μ±!
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring
·2684 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ University of Southern California
70μ΅ κ° λ§€κ°λ³μλ₯Ό κ°μ§ λ©νμ μ 체 κΈ°λ° λκ·λͺ¨ μΈμ΄ λͺ¨λΈ(METAGENE-1)μ΄ νμ λ°μ΄ν°λ‘ νλ ¨λμ΄ λ³μκ· νμ§ λ° μ μ 체 μμ΄ μλ² λ© μμ
μμ μ΅μ²¨λ¨ μ±λ₯μ λ¬μ±νμ΅λλ€.
Ingredients: Blending Custom Photos with Video Diffusion Transformers
·2088 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Kunlun Inc.
κ³ νμ§ λ€μ€ ID λ§μΆ€ν λΉλμ€ μμ±μ μν νμ μ μΈ νλ μμν¬, Ingredients μκ°!
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
·2819 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
AI Applications
Robotics
π’ AgiBot
EnerVerse: λ‘λ΄ μ‘°μμ μν λ―Έλ κ³΅κ° μμ± νλ μμν¬κ° μ₯κΈ°κ° μμ
μμ μ±λ₯ ν₯μμ λ¬μ±νμ΅λλ€.
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
·3175 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ Ant Group
AUTO-RT: μλνλ μ¬λ° μ λ΅ νμμΌλ‘ LLM μ·¨μ½μ ν¨μ¨μ μΌλ‘ λ°κ²¬!
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
·2466 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Hong Kong University of Science and Technology
VideoAnydoor: μ λ°ν λͺ¨μ
μ μ΄λ₯Ό κ°μΆ κ³ νμ§ μμ κ°μ²΄ μ½μ