Joochan Kim

I'm a research intern at KIST in Seoul, mentored by Hwasup Lim and Taekgeun You, where I am working on Embodied AI, mostly focusing on Vision-Language-Action models.

I finished my master's degree at SNU, where I was advised by Byoung-Tak Zhang. During my master course, I did internship at A*STAR mentored by Haiyue Zhu. Also, I did my bachelor's degree at Yonsei University.

Email  /  CV  /  Scholar  /  LinkedIn  /  Github

profile photo

Research

I'm interested in multimodal AI, generative AI, and data-center AI. Most of my research is about understanding video and images with language guidance, thereby developing embodied AI from internet AI. Some papers with highlight are papers with main contributions.

Exploring Ordinal Bias in Action Recognition for Instructional Videos
Joochan Kim, Minjoon Jung, Byoung-Tak Zhang
ICLRW, 2025
poster / arXiv

Ordinal bias leads action recognition models to over-rely on dominant action pairs, inflating performance and lacking true video comprehension even when challenged by action masking and sequence shuffling.

Background-aware Moment Detection for Video Moment Retrieval
Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim, Byoung-Tak Zhang
WACV, 2025
arXiv / Code

We propose Background-aware Moment Detection TRansformer (BM-DETR), which carefully adopts a contrastive approach for robust prediction. BM-DETR achieves state-of-the-art performance on various benchmarks while being highly efficient.

Zero-Shot Vision-and-Language Navigation with Collision Mitigation in Continuous Environment
Seongjun Jeong, Gi-Cheon Kang, Joochan Kim, Byoung-Tak Zhang
WACV, 2025
arXiv

We propose the zero-shot Vision-and-Language Navigation with Collision Mitigation (VLN-CM), which takes low-level actions as an output while considering possible collisions.

Continual Vision-and-Language Navigation
Seongjun Jeong, Gi-Cheon Kang, Seongho Choi, Joochan Kim, Byoung-Tak Zhang
arXiv Preprint, 2024
arXiv

We propose Continual Vision-and-Language Navigation (CVLN) paradigm along with two methods for CVLN: Perplexity Replay (PerpR) and Episodic Self-Replay (ESR).

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim, Byoung-Tak Zhang
EMNLP, 2022
arXiv

We propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal moments via subtitle-based moment sampling.Then, it generates pseudo queries exploiting both visual and textual information from the selected temporal moments.

Miscellanea

Teaching

Teaching Assistant, M1522.000300 Spring 2023

Feel free to steal this website's source code. Do not scrape the HTML from this page itself, as it includes analytics tags that you do not want on your own website — use the github code instead. Also, consider using Leonid Keselman's Jekyll fork of this page.