π About Me
Iβm a PhD student at The Chinese University of Hong Kong, supervised by Professor JIA Jiaya and Professor YU Bei. Before that, I obtained my master degree at AIM3 Lab, Renmin University of China, under the supervision of Professor JIN Qin. I received my Bachelorβs degree in 2021 from South China University of Technology.
My research interest includes Computer Vision and Multi-modal Large Language Models. Here is my google scholar page.
π₯ News
- 2025.10: Β ππ We are excited to release ViSurf!
- 2025.06: Β ππ Lyra is accepted by ICCV 2025!
- 2025.05: Β ππ We are excited to release VisionReasoner!
- 2025.03: Β ππ We are excited to release Seg-Zero!
- 2024.07: Β ππ One paper is accepted by ACMMM 2024!
- 2022.11: Β ππ One paper is accepted by AAAI 2023!
- 2022.10: Β ππ Our team rank the 1st in Trecvid 2022 VTT task!
- 2022.05: Β ππ One paper is accepted by ECCV 2022!
π Publications

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia
- ViSurf (Visual Supervised-and-Reinforcement Fine-Tuning) is a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage.

VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning
Yuqi Liu* , Tianyuan Qu* , Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia
- VisionReasoner is a unified framework for visual perception tasks.
- Through carefully crafted rewards and training strategy, VisionReasoner has strong multi-task capability, addressing diverse visual perception tasks within a shared model.

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu* , Bohao Peng* , Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia
- Seg-Zero exhibits emergent test-time reasoning ability. It generates a reasoning chain before producing the final segmentation mask.
- Seg-Zero is trained exclusively using reinforcement learning, without any explicit supervised reasoning data.
- Compared to supervised fine-tuning, our Seg-Zero achieves superior performance on both in-domain and out-of-domain data.

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong*, Chengyao Wang*, Yuqi Liu*, Senqiao Yang,Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia
- Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
- More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
- More efficient: Less training data, support faster training and inference.

Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval
Yang Du*, Yuqi Liu*, Qin Jin
- A benchmark aims to evaluate temporal understanding of video retrieval models.

Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language
Yuqi Liu, Luhui Xu, Pengfei Xiong, Qin Jin
- We study how to transfer knowledge from image-language model to video-language tasks.
- We have implemented several components proposed by recent works.

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin
- TS2-Net is a text-video retrieval model based on CLIP.
- We propose our token shift transformer and token selection transformer.
π Educations
- 2024.08 - 2028.06 (Expect), Ph.D., Department of Computer Science and Engineering, The Chinese University of Hong Kong.
- 2021.09 - 2024.06, M.Phil., School of Information, Renmin University of China.
- 2017.09 - 2021.06, B.E., School of Software Engineering, South China University of Technology.
π Teaching
- 2025 Fall, CSCI1580
- 2025 Spring, ENGG2020
- 2024 Fall, CSCI3170