Yize Cheng

I am a second-year Ph.D. student in Computer Science at the University of Maryland (UMD), advised by Prof. Soheil Feizi. Broadly speaking, my research interests span problems related to the reliability of the evaluation and deployment of LLMs and LLM agents. More specifically, I have worked on reliable LLM evaluation (EMNLP’25), reliable agentic tool call in LLM agents (EMNLP’25), aligning LLM tool call behavior with human in multi-turn settings (arXiv’25), and red-teaming AI text detection (NeurIPS’25). I’ve also worked on LLM reasoning, including interpretable reasoning (arXiv’25) and explorative reasoning (arXiv’26).

Prior to joining UMD, I earned my bachelor’s degree in Computer Science and Electrical Engineering from the Hong Kong University of Science and Technology (HKUST). During my undergraduate studies, I had the opportunity to work with Prof. Minhao Cheng on backdoor attacks and machine learning watermarks. I also spent a semester on exchange at ETH Zurich, where I was fortunate to have worked with Prof. Florian Tramèr on adversarial examples and diffusion models.

News

Sep 18, 2025	One paper accepted at NeurIPS 2025: Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text. See you in San Diego!
Aug 20, 2025	Two papers accepted at EMNLP 2025 Main Conference: DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors Tool Preferences in Agentic LLMs are Unreliable
Jun 08, 2025	New paper on arXiv: Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text. Many thanks to all my collaborators. Code released here.
May 29, 2025	New paper on arXiv: DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors . Many thanks to all my collaborators. Code released here.
May 23, 2025	New paper on arXiv: Gaming Tool Preferences in Agentic LLMs. Many thanks to all my collaborators. Code released here.
Aug 26, 2024	I’ve started my PhD in Computer Science at the University of Maryland, where I am advised by Prof. Soheil Feizi.
May 31, 2024	I’ve graduated from HKUST with the Academic Achievement Medal! Thank you to my parents, friends, and all my mentors!
Sep 16, 2023	I have completed my first research paper, where we propose a simple yet effective clean-label backdoor attack for object detectors. Many thanks to Prof. Minhao Cheng for his invaluable support.

Publications

(*) denotes equal contribution

NeurIPS

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Yize Cheng^*, Vinu Sankar Sadasivan^*, Mehrdad Saberi, Shoumik Saha, and Soheil Feizi

In Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

Abs PDF Code

The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack–which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT–adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors–including neural network-based, watermark-based, and zero-shot approaches–our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.
EMNLP

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Yize Cheng^*, Wenxiao Wang^*, Mazda Moayeri, and Soheil Feizi

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

Abs PDF Video Code Poster

Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.
EMNLP

Tool Preferences in Agentic LLMs are Unreliable

Kazem Faghih^*, Wenxiao Wang^*, Yize Cheng^*, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, and Soheil Feizi

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

Abs PDF Video Code Poster

Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use—a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool’s usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 17 different models. These phenomena, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources. Our code is publicly available at https://github.com/kazemf78/llm-unreliable-tool-preferences.

Preprints

(*) denotes equal contribution

arXiv

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

Yize Cheng^*, Arshia Soltani Moakhar^*, Chenrui Fan^*, Parsa Hosseini, Kazem Faghih, Zahra Sodagar, Wenxiao Wang, and Soheil Feizi

arXiv preprint, 2026

Abs PDF Code

Large language model (LLM) agents are increasingly used to interact with and execute tasks in dynamic environments. However, a critical yet overlooked limitation of these agents is that they, by default, assume a stationary context, failing to account for the real-world time elapsed between messages. We refer to this as "temporal blindness". This limitation hinders decisions about when to invoke tools, leading agents to either over-rely on stale context and skip needed tool calls, or under-rely on it and redundantly repeat tool calls. To study this challenge, we constructed TicToc, a diverse dataset of multi-turn user-agent message trajectories across 76 scenarios, spanning dynamic environments with high, medium, and low time sensitivity. We collected human preferences between "calling a tool" and "directly answering" on each sample, and evaluated how well LLM tool-calling decisions align with human preferences under varying amounts of elapsed time. Our analysis reveals that existing models display poor alignment with human temporal perception, with no model achieving a normalized alignment rate better than 65% when given time stamp information. We also show that naive, prompt-based alignment techniques have limited effectiveness for most models, but specific post-training alignment can be a viable way to align multi-turn LLM tool use with human temporal perception. Our data and findings provide a first step toward understanding and mitigating temporal blindness, offering insights to foster the development of more time-aware and human-aligned agents.
arXiv

Failing to Explore: Language Models on Interactive Tasks

Mahdi JafariRaviz^*, Keivan Rezaei^*, Arshia Soltani Moakhar^*, Zahra Sodagar, Yize Cheng, and Soheil Feizi

arXiv preprint, 2026

Abs PDF Code

We evaluate language models on their ability to explore interactive environments under a limited interaction budget. We introduce three parametric tasks with controllable exploration difficulty, spanning continuous and discrete environments. Across state-of-the-art models, we find systematic under-exploration and suboptimal solutions, with performance often significantly worse than simple explore–exploit heuristic baselines and scaling weakly as the budget increases. Finally, we study two lightweight interventions: splitting a fixed budget into parallel executions, which surprisingly improves performance despite a no-gain theoretical result for our tasks, and periodically summarizing the interaction history, which preserves key discoveries and further improves exploration.
arXiv

Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

Ming Li^*, Chenrui Fan^*, Yize Cheng^*, Soheil Feizi, and Tianyi Zhou

arXiv preprint, 2025

Abs PDF Code

Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld’s Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
arXiv

Attacking by Aligning: Clean-Label Backdoor Attacks on Object Detection

Yize Cheng^*, Wenbin Hu^*, and Minhao Cheng

arXiv preprint, 2023

Abs PDF Code

Deep neural networks (DNNs) have shown unprecedented success in object detection tasks. However, it was also discovered that DNNs are vulnerable to multiple kinds of attacks, including Backdoor Attacks. Through the attack, the attacker manages to embed a hidden backdoor into the DNN such that the model behaves normally on benign data samples, but makes attacker-specified judgments given the occurrence of a predefined trigger. Although numerous backdoor attacks have been experimented on image classification, backdoor attacks on object detection tasks have not been properly investigated and explored. As object detection has been adopted as an important module in multiple security-sensitive applications such as autonomous driving, backdoor attacks on object detection could pose even more severe threats. Inspired by the inherent property of deep learning-based object detectors, we propose a simple yet effective backdoor attack method against object detection without modifying the ground truth annotations, specifically focusing on the object disappearance attack and object generation attack. Extensive experiments and ablation studies prove the effectiveness of our attack on the benchmark object detection dataset MSCOCO2017, on which we achieve an attack success rate of more than 92% with a poison rate of only 5%.