Research · Engineering ·

Li Yantao's Homepage

I study MLLM, agents, and diffusion Language Models. This site collects my research, projects, and working notes. Have a good day

About

Hi, I’m Li Yantao. This is my homepage for sharing research directions, publications, projects, and occasional notes.

I am a Ph.D. candidate in the NLP Group at Nanjing University (NJU NLP), enrolled in the direct doctorate program starting in 2023. I am currently interning at China Unicom. My advisor is Assoc. Prof. Jian-Bing Zhang.

Research interests. Reinforcement learning for LLMs (RL for LLMs), especially multimodal model reasoning and reward modeling, plus related exploration of diffusion language models (Diffusion LLMs).

Contact. Reach me anytime at li_yantao@smail.nju.edu.cn — I’m always happy to connect.

Publications

Selected papers and research projects.

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

CVPR · 2026 · First Author

Abstract: Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations–cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not...

arXiv Code Project

Vision-Language Models Can Self-Improve Reasoning via Reflection

NAACL · 2025 · First Author

Abstract: Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs). However, due to the complexity of multimodal scenarios and the difficulty in collecting high-quality CoT data, CoT reasoning in multimodal LLMs has been largely overlooked. To this end, we propose a simple yet effective self-training...

arXiv Code

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

ICLR · 2026

Abstract: In Flow Matching inference, existing caching methods primarily rely on reusing Instantaneous Velocity or its feature-level proxies. However, we observe that instantaneous velocity often exhibits sharp fluctuations across timesteps. This leads to severe trajectory deviations and cumulative errors, especially as the cache interval increases. Inspired by MeanFlow, we propose...

arXiv Code

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

ACL · 2024

Abstract: Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we...

arXiv Code

Latest Notes

Research thoughts, reading notes, and technical essays.

Welcome

January 1, 2024 · 2 min read

A short note about what this research blog will collect.

View all posts →