publications | Xiyang Hu

2026

preprint

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

Xiaolin Zhou, Aojie Yuan, Zheng Luo, and 12 more authors

2026

HTML
preprint

Counterfactual Trace Auditing of LLM Agent Skills

Xiaolin Zhou, Jinbo Liu, Li Li, and 2 more authors

2026

HTML
ICLR

TrustGen: A Platform of Dynamic Benchmarking on the Trustworthiness of Generative Foundation Models

Yue Huang, Chujie Gao, Siyuan Wu, and 61 more authors

In The Fourteenth International Conference on Learning Representations (ICLR), 2026

HTML
ACL

Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, and 3 more authors

In Proceedings of the The 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026

HTML
ICML

" Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Jiate Li, Defu Cao, Li Li, and 8 more authors

In Proceedings of International Conference on Machine Learning (ICML), 2026

HTML
ACL

Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

Tiancheng Xing, Jerry Li, Yixuan Du, and 1 more author

In Proceedings of the The 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026

HTML
ACL

Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict

Guanyu Chen, Chenxiao Yu, and Xiyang Hu

In Findings of the The 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026

HTML
ACL

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Jinbo Liu, Defu Cao, Yifei Wei, and 5 more authors

In Findings of the The 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026

HTML
AAAI

Mitigating Hallucinations in Large Language Models via Causal Reasoning

Yuangang Li, Yiqing Shen, Yi Nian, and 7 more authors

In Proceedings of the AAAI Conference on Artificial Intelligence, 2026

HTML
MS

Human-Algorithmic Bias: Source, Evolution, and Impact

Xiyang Hu, Yan Huang, Beibei Li, and 1 more author

Management Science, 2026

Abs HTML

Prior work on human-algorithmic bias has seen difficulty in empirically identifying the underlying mechanisms of bias, because in a typical "one-time" decision-making scenario, different mechanisms tend to generate the same patterns of observable decisions. In this study, leveraging a unique repeat decision-making setting in a high-stakes micro-lending context, we aim to uncover the underlying source, evolution dynamics, and associated impacts of bias. We first develop and estimate a structural econometric model of the decision dynamics to understand the source and evolution of potential bias in human evaluators in microloan granting. We find that both preference-based bias and belief-based bias are present in human evaluators’ decisions and are in favor of female applicants. Through counterfactual simulations, we quantify the effects of the two types of bias on both fairness and profits. The results show that the elimination of either of the two biases improves the fairness in financial resource allocation, as well as the platform profits. The profit improvement mainly stems from the increase in the approval probability for male borrowers, especially those who would eventually pay back loans. Furthermore, to examine how human biases evolve when being inherited by machine learning (ML) algorithms, we then train a set of state-of-the-art ML algorithms for default risk prediction on both real-world datasets with human biases encoded within and counterfactual datasets with human biases partially or fully removed. By comparing the decision outcomes in different counterfactual settings, we find that even fairness-unaware ML algorithms can reduce bias present in human loan-granting decisions. Interestingly, while removing both types of human biases from the training data can further improve ML fairness, the fairness-enhancing effects vary significantly between new and repeat applicants. Based on our findings, we discuss how to reduce decision bias most effectively in a human-machine learning pipeline.

2025

IJCNLP-AACL

AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection

Tiankai Yang, Junjun Liu, Wingchun Siu, and 6 more authors

In Findings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), 2025

HTML
preprint

StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization

Yiming Tang, Yi Fan, Chenxiao Yu, and 3 more authors

2025

HTML
preprint

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

Xiyang Hu

2025

HTML
preprint

A Large-Scale Simulation on Large Language Models for Decision-Making in Political Science

Chenxiao Yu, Jinyi Ye, Yuangang Li, and 5 more authors

2025

HTML
preprint

DrugAgent: A Theory-Driven LLM Multi-Agent System for Automating Machine Learning Programming in Drug Discovery

Jiate Li, Sizhe Liu, Xiyang Hu, and 2 more authors

Available at SSRN 5746063, 2025

HTML
preprint

DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration

Sizhe Liu, Yizhou Lu, Siyu Chen, and 4 more authors

2025

HTML
ICCV

Secure On-Device Video OOD Detection Without Backpropagation

Shawn Li, Peilin Cai, Yuxiao Zhou, and 7 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2025
preprint

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

Yue Huang, Chujie Gao, Siyuan Wu, and 63 more authors

Oct 2025

HTML
AD-LLM: Benchmarking Large Language Models for Anomaly Detection

Tiankai Yang, Yi Nian, Li Li, and 9 more authors

In Findings of the Association for Computational Linguistics: ACL 2025, Jul 2025

Abs DOI

Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs’ pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.
NLP-ADBench: NLP Anomaly Detection Benchmark

Yuangang Li, Jiaqi Li, Zhuo Xiao, and 4 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2025, Nov 2025

2024

preprint

Political-LLM: Large Language Models in Political Science

Lincan Li, Jiaqi Li, Catherine Chen, and 44 more authors

Nov 2024

HTML
preprint

Towards More Accurate US Presidential Election via Multi-step Reasoning with Large Language Models

Chenxiao Yu, Zhaotian Weng, Yuangang Li, and 3 more authors

Nov 2024

HTML
COOD: Concept-based Zero-shot OOD Detection

Zhendong Liu, Yi Nian, Henry Peng Zou, and 3 more authors

Nov 2024

HTML
JMLR

PyGOD: A Python Library for Graph Outlier Detection

Kay Liu, Yingtong Dou, Xueying Ding, and 5 more authors

Journal of Machine Learning Research, Nov 2024

HTML

2023

NeurIPS

ADGym: Design Choices for Deep Anomaly Detection

Minqi Jiang, Chaochuan Hou, Ao Zheng, and 5 more authors

Advances in Neural Information Processing Systems, Nov 2023

Abs

Deep learning (DL) techniques have recently been applied to anomaly detection (AD), with numerous successful applications in finance, medical service, cloud computing, etc. However, current research often directly evaluates a deep AD algorithm as a whole, which cannot disentangle the contribution of each design choice (e.g., loss functions and network architectures). Meanwhile, we may neglect the contribution of other meaningful prerequisite steps like preprocessing by giving all credits to newly designed loss functions and/or architectures. In this paper, we address the above gaps by answering: (i) which components (i.e., design choices) of deep AD methods play crucial roles in detecting anomalies? (ii) how can we build more effective AD algorithms given different datasets by automatically choosing the optimal design choices other than using existing off-the-shelf ones? To this end, we propose ADGym, the first and fully open-source design platform for large evaluation and automatic selection of AD design choices. Extensive experiments show that directly using existing leading methods is not optimal, and the AD models composed by ADGym significantly outperform state-of-the-art methods.
ACL

Language Agnostic Multilingual Information Retrieval with Contrastive Learning

Xiyang Hu, Xinchi Chen, Peng Qi, and 4 more authors

Annual Meeting of the Association for Computational Linguistics - Findings of ACL, Nov 2023

Abs HTML

Multilingual information retrieval (IR) is challenging since annotated training data is costly to obtain in many languages. We present an effective method to train multilingual IR systems when only English IR training data is available, by leveraging parallel and non-parallel corpora to improve the pretrained multilingual language models’ cross-lingual transfer ability. We design a semantic contrastive loss to align representations of parallel sentences that share the same semantics in different languages, and a new language contrastive loss to leverage parallel sentence pairs to remove language-specific information in sentence representations from non-parallel corpora. When trained on English IR data with these losses and evaluated zero-shot on non-English data, our model demonstrates significant improvement to prior work on retrieval performance, while it requires much less computational effort. Our model can work well even with a small number of parallel sentences, and it can be used as an add-on module to any backbone and other tasks.
preprint

Inclusive Decision Making via Contrastive Learning and Domain Adaptation

Xiyang Hu, Yan Huang, Beibei Li, and 1 more author

MIS Quarterly (Under Major Revision), Nov 2023

HTML
preprint

Weakly Supervised Anomaly Detection: A Survey

Minqi Jiang, Chaochuan Hou, Ao Zheng, and 6 more authors

arXiv preprint arXiv:2302.04549, Nov 2023

HTML

2022

NeurIPS

ADBench: Anomaly Detection Benchmark

Xiyang Hu, Songqiao Han, Hailiang Huang, and 2 more authors

Advances in Neural Information Processing Systems, Nov 2022

Abs HTML Code

Given a long list of anomaly detection algorithms developed in the last few decades, how do they perform with regard to (i) varying levels of supervision, (ii) different types of anomalies, and (iii) noisy and corrupted data? In this work, we answer these key questions by conducting (to our best knowledge) the most comprehensive anomaly detection benchmark with 30 algorithms on 55 benchmark datasets, named ADBench. Our extensive experiments (93,654 in total) identify meaningful insights into the role of supervision and anomaly types, and unlock future directions for researchers in algorithm selection and design. With ADBench, researchers can easily conduct comprehensive and fair evaluations for newly proposed methods on the datasets (including our contributed ones from natural language and computer vision domains) against the existing baselines. To foster accessibility and reproducibility, we fully open-source ADBench and the corresponding results.
NeurIPS

Benchmarking Node Outlier Detection on Graphs

Kay Liu, Yingtong Dou, Yue Zhao, and 8 more authors

Advances in Neural Information Processing Systems, Nov 2022

Abs HTML Code

Graph outlier detection is an emerging but crucial machine learning task with numerous applications. Despite the proliferation of algorithms developed in recent years, the lack of a standard and unified setting for performance evaluation limits their advancement and usage in real-world applications. To tap the gap, we present, (to our best knowledge) the first comprehensive unsupervised node outlier detection benchmark for graphs called UNOD, with the following highlights: (1) evaluating fourteen methods with backbone spanning from classical matrix factorization to the latest graph neural networks; (2) benchmarking the method performance with different types of injected outliers and organic outliers on real-world datasets; (3) comparing the efficiency and scalability of the algorithms by runtime and GPU memory usage on synthetic graphs at different scales. Based on the analyses of extensive experimental results, we discuss the pros and cons of current UNOD methods and point out multiple crucial and promising future research directions.
ICIS

Credit Risk Modeling without Sensitive Features: An Adversarial Deep Learning Model for Fairness and Profit

Xiyang Hu, Yan Huang, Beibei Li, and 1 more author

International Conference on Information Systems, Nov 2022

Abs HTML

We propose an adversarial deep learning model for credit risk modeling. We make use of sophisticated machine learning model’s ability to triangulate (i.e., infer the sensitive group affiliation by using only permissible features), which is often deemed “troublesome” in fair machine learning research, in a positive way to increase both borrower welfare and lender profits while improving fairness. We train and test our model on a dataset from a real-world microloan company. Our model significantly outperforms regular deep neural networks without adversaries and the most popular credit risk model XGBoost, in terms of both improving borrowers’ welfare and lenders’ profits. Our empirical findings also suggest that the traditional AUC metric cannot reflect a model’s performance on the borrowers’ welfare and lenders’ profits. Our framework is ready to be customized for other microloan firms, and can be easily adapted to many other decision-making scenarios.
TKDE

ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions

Zheng Li, Yue Zhao, Xiyang Hu, and 3 more authors

IEEE Transactions on Knowledge and Data Engineering, Nov 2022

Abs HTML Code

Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.

2021

ICIS

Uncovering the Source of Evaluation Bias in Micro-Lending

Xiyang Hu, Yan Huang, Beibei Li, and 1 more author

International Conference on Information Systems, Nov 2021

Abs HTML

We develop a structural econometric model to capture the decision dynamics of human evaluators on an online micro-lending platform, and estimate the model parameters using real-world data. We find two types of biases in gender, i.e. preference-based bias and belief-based bias, are present in human evaluators’ decisions. Both types of biases are in favor of female applicants. Through counterfactual simulations, we quantify the effect of gender bias on loan granting outcomes and the welfare of the company and the borrowers. Our results imply that both the existence of the preference-based bias and that of the belief-based bias reduce the company’s profits. When the preference-based bias is removed, the company earns more profits. When the belief-based bias is removed, the company’s profits also increase. Both increases result from lowering the approval probability for borrowers, especially female borrowers, who eventually default on loans. For borrowers, the elimination of either bias decreases the gender gap in the credit risk evaluation.
MLSys

SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection

Xiyang Hu, Yue Zhao, Cheng Cheng, and 11 more authors

Conference on Machine Learning and Systems, Nov 2021

Abs HTML

Outlier detection (OD) is a key data mining task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised models that are heterogeneous (i.e., different algorithms and hyperparameters) for further combination and analysis with ensemble learning, rather than relying on a single model. However, this yields severe scalability issues on high-dimensional, large datasets. How to accelerate the training and predicting with a large number of heterogeneous unsupervised OD models? How to ensure the acceleration does not deteriorate detection models’ accuracy? How to accommodate the acceleration need for both a single worker setting and a distributed system with multiple workers? In this study, we propose a three-module acceleration system called SUOD (scalable unsupervised outlier detection) to address these questions. It focuses on three complementary aspects to accelerate (dimensionality reduction for high-dimensional data, model approximation for complex models, and execution efficiency improvement for taskload imbalance within distributed systems), while controlling detection performance degradation. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD’s effectiveness in heterogeneous OD acceleration. By the submission time, the released open-source system has been widely used with more than 700,000 times downloads. A real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm, is also provided.
KDD

Uncovering the Source of Machine Bias

Xiyang Hu, Yan Huang, Beibei Li, and 1 more author

27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Machine Learning for Consumers and Markets Workshop, Nov 2021

Abs HTML

We develop a structural econometric model to capture the decision dynamics of human evaluators on an online micro-lending platform, and estimate the model parameters using a real-world dataset. We find two types of biases in gender, preference-based bias and belief-based bias, are present in human evaluators’ decisions. Both types of biases are in favor of female applicants. Through counterfactual simulations, we quantify the effect of gender bias on loan granting outcomes and the welfare of the company and the borrowers. Our results imply that both the existence of the preference-based bias and that of the belief-based bias reduce the company’s profits. When the preference-based bias is removed, the company earns more profits. When the belief-based bias is removed, the company’s profits also increase. Both increases result from raising the approval probability for borrowers, especially male borrowers, who eventually pay back loans. For borrowers, the elimination of either bias decreases the gender gap of the true positive rates in the credit risk evaluation. We also train machine learning algorithms on both the real-world data and the data from the counterfactual simulations. We compare the decisions made by those algorithms to see how evaluators’ biases are inherited by the algorithms and reflected in machine-based decisions. We find that machine learning algorithms can mitigate both the preference-based bias and the belief-based bias.
JAH

开化寺大雄宝殿彩画测绘复原

Xiyang Hu, and others

建筑史学刊(Journal of Architecture History), Nov 2021

HTML PDF

2020

ICDM

COPOD: Copula-Based Outlier Detection

Zheng LI, Yue Zhao, Nicola Botta, and 2 more authors

IEEE International Conference on Data Mining, Nov 2020

Abs HTML

Outlier detection refers to the identification of rare items that are deviant from the general data distribution. Existing approaches suffer from high computational complexity, low predictive capability, and limited interpretability. As a remedy, we present a novel outlier detection algorithm called COPOD, which is inspired by copulas for modeling multivariate data distribution. COPOD first constructs an empirical copula, and then uses it to predict tail probabilities of each given data point to determine its level of "extremeness". Intuitively, we think of this as calculating an anomalous p-value. This makes COPOD both parameter-free, highly interpretable, and computationally efficient. In this work, we make three key contributions, 1) propose a novel, parameter-free outlier detection algorithm with both great performance and interpretability, 2) perform extensive experiments on 30 benchmark datasets to show that COPOD outperforms in most cases and is also one of the fastest algorithms, and 3) release an easy-to-use Python implementation for reproducibility.

2019

NeurIPS

Optimal Sparse Decision Trees

Xiyang Hu, Cynthia Rudin, and Margo Seltzer

Advances in Neural Information Processing Systems, Nov 2019

Abs HTML Code

Decision tree algorithms have been among the most popular algorithms for interpretable (transparent) machine learning since the early 1980’s. The problem that has plagued decision tree algorithms since their inception is their lack of optimality, or lack of guarantees of closeness to optimality: decision tree algorithms are often greedy or myopic, and sometimes produce unquestionably suboptimal models. Hardness of decision tree optimization is both a theoretical and practical obstacle, and even careful mathematical programming approaches have not been able to solve these problems efficiently. This work introduces the first practical algorithm for optimal decision trees for binary variables. The algorithm is a co-design of analytical bounds that reduce the search space and modern systems techniques, including data structures and a custom bit-vector library. Our experiments highlight advantages in scalability, speed, and proof of optimality.