Publications

You can also find my articles on my Google Scholar profile.

Conference

[C8] MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh. Findings of the Empirical Methods in Natural Language Processing: EMNLP 2025 (EMNLP-Findings 2025, Long), 2025

One-sentence Summary: A language-agnostic framework that evaluates LLM multilingual generation by measuring task completion rates in self-communication scenarios, enabling objective assessment across 2,100+ languages without requiring language-specific tools or humans.

[C5] Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models

Chani Jung, Dongkwan Kim, Jiho Jin, Jiseon Kim, Yeon Seonwoo, Yejin Choi, Alice Oh, and Hyunwoo Kim. Empirical Methods in Natural Language Processing (EMNLP), 2024

One-sentence Summary: We assess key precursors of Theory of Mind (ToM) in LLMs by perception-augmented ToM benchmarks. We propose PercepToM, a ToM method inspired by our findings of models’ strength in perception inference and weakness in perception-to-belief inference.

Preprint

Workshop

[W6] PertReasonQA: A Knowledge-Grounded Benchmark and Framework for Cell-State–Conditioned Mechanistic Reasoning of Perturbation Effects

Dongkwan Kim, Yiming Gao, Yining Yang, Yang Shen. Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences at ICML (ICML FM4LS), 2026

One-sentence Summary: We propose PertReasonQA, a knowledge-grounded QA benchmark designed to evaluate how models mechanistically reason through cell-state-specific perturbation effects by leveraging causal pathways.