Visual Question Answering
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
·2414 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Visual Question Answering
π’ Seoul National University
μ΄μ λ° μ΄λ―Έμ§ μΊ‘μ
μμ±μ νκ° λ¬Έμ ν΄κ²°μ μν΄, LLM-MLLM νμ
κΈ°λ°μ λ€μ€ μμ΄μ νΈ μμ€ν
(CapMAS)μ μ μνμ¬ μ¬μ€μ±κ³Ό ν¬κ΄μ±μ λμμ΅λλ€.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
·4794 words·23 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Visual Question Answering
π’ Stanford University
MLLMμ μκ°-κ³΅κ° μ§λ₯ ν₯μμ λμμ΄ λλ μλ‘μ΄ λΉλμ€ κΈ°λ° λ²€μΉλ§ν¬ VSI-Bench λ°ν!