Towards Automated Error Discovery: A Study in Conversational AI

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Abstract

Although LLM-based conversational agents demonstrate strong fluency and coherence, they continue to exhibit behavioral errors, such as inconsistencies and factual inaccuracies. Detecting and mitigating these errors is critical for developing trustworthy systems. However, current response correction methods rely heavily on large language models (LLMs), which require information about the nature of an error or hints about its occurrence for accurate detection. This limits their ability to identify errors not defined in their instructions or covered by external tools, such as those arising from updates to the response-generation model or shifts in user behavior.

Figure 1: Feedback-guided response generation: (1) The response-generation model produces an initial response. (2) The feedback LLM, or in self-correcting systems the response-generation model itself, evaluates the response for errors, often using external tools. Recent work shows that LLMs require information about the nature of an error or hints about its occurrence for accurate detection. (3) The feedback LLM provides guidance (feedback) to the response-generation model to refine its output. (4) The final response is presented to the user.

In this work, we introduce Automated Error Discovery, a framework for detecting and defining behavioral errors in conversational AI, and propose SEEED (Soft-clustering Extended Encoder-Based Error Detection), an encoder-based alternative to LLMs for error detection. We enhance the Soft Nearest Neighbor Loss by amplifying distance weighting for negative samples and introduce Label-Based Sample Ranking to select highly contrastive examples for better representation learning.

Figure 2: Schematic overview of SEEED, comprising three components: Summary Generation, Error Detection, and Error Definition Generation (e denotes the identified error type). In practical applications (see Figure 1 for an example), the feedback LLM may be used for generating summaries and error definitions (if necessary) to reduce deployment costs, as both are summarization tasks typically covered during LLM pre-training. Newly defined error types are added to the pool of known types, and their dialogue contexts could be used to enhance error detection.

SEEED outperforms adapted baselines across multiple error-annotated dialogue datasets, improving the accuracy for detecting novel behavioral errors by up to 8 points and demonstrating strong generalization to unknown intent detection.

In this repository, we provide the source code for SEEED and the baselines used in the paper. In addition, we provide example shell scripts for running our experiments. For experiments with FEDI, we used the error-annotated subset of FEDI v2, available at TU datalib. For experiments with Soda-Eval, we used the dataset as published in the Huggingface Datahub, and for experiments with ABCEval, we used the dataset provided on GitHub.

Getting Started

1. Installation We provide a pyproject.toml file for installing our code as a package in your Python >=3.12 environment. Just run pip install -e . to install our code and all required dependencies.

2. Running Our Code Navigate to scripts. There you will find one .sh file for each approach, SEEED, SynCID, LOOP, KNN-Contrastive, and our fine-tuning experiments with Phi-4. Remove the SLURM specific parameters if you are not running the scripts in a SLURM environment, set the required parameters according to your environment, create the required directories, and run the script (or copy the Python command to your shell). Here is a list of all possible parameters:

Argument	Description	Approach
--dataset	The path or to the dataset to use (from a local directory or Huggingface).
--token	The token to use for downloading the dataset from Huggingface.
--novelty	The ration of classes to randomly sample as novel.
--n_labels	The list of classes to treat as novel (alternative to 'novelty'), e.g., for a subsequent run in a multi-stage training approach. Will always override 'novelty'.
--error_turn	Whether or not to use the whole error turn instead of just the error utterance.
--model_path	The path to a pretrained huggingface model (the foundation model to be used).
--pretrained	The path to a model pretrained using this framework.
--batch_size	The number of samples per batch.
--epochs	The number of epochs for the main training.
--epochs_2	The number of epochs for the second training stage (if required).	SynCID, LOOP
--device	The device to use for training ('cuda' or 'cpu').
--save_dir	The directory where to save the trained model.
--experiment	The name of the experiment (for MLFlow).
--visualize	Whether or not to visualize the representation space after evaluation.
--contrastive_weighting	Weighting factor for contrastive learning.	SEEED, KNNContrastive
--topk	The number of top-k samples to consider for calculating inconsistency.	SEEED
--positives	The number of positive samples per error type	SEEED, KNNContrastive
--num_negatives	The number of negative samples per error type.	SEEED
--resample_rate	The ration of epochs after which to resample the dataset for contrastive learning.	SEEED
--error_types	The data type for loading the correct error definitions and templates	Phi-4, LOOP
--alpha	The weighting factor for the unsupervised contrastive loss.	SynCID
--beta	The weighting factor for the supervised contrastive loss.	SynCID

In our experiments with SEEED, we only used one positive and negative counterpart per sample, but theoretically, there is no upper limit.

Automated Error Discovery

We distinguish two sub-tasks, Error Detection and Error Definition Generation, and define the following formal setup:

$E = E^K \cup E^U$ is the set of all behavioral error types. $E^K = {(e_i, d_i)}_{i=1}^m$ is the set of known error types, with $e_i$ as the error identifier and $d_i$ as its definition. $E^U$ denotes the set of unknown error types. $E^K \cap E^U = \emptyset$.
$C = C^K \cup C^U$ denotes the set of all dialogue contexts $T$, with $C^{K}$ as the set of all $T$ associated with a behavioral error $e$ from $E^{K}$. $C^{U}$ is the set of dialogues associated with unknown behavioral errors. $C^K \cap C^U = \emptyset$.
We define a dialogue context $T$ as a sequence of user-agent utterances (turns). Depending on the use case, $T$ may be associated with additional features, such as external knowledge documents in knowledge-grounded dialogues. We refer to these additional features as $W_T$. In this work, $W$ is relevant only as external knowledge in the knowledge-grounded subset of FEDI.

Error Detection

Given an error detection function $\mathcal{H}: \mathbb{R}^{d} \mapsto \mathbb{N}$ and a dialogue context $T \in C$, the task is to determine the behavioral error $e \in E$ associated with the last agent utterance in $T$:

$$e = \mathcal{H}(T, W_T), \text{ where } e \in E \text{ and } T \in C$$

$\mathcal{H}$ must not access any data in $E^{U}$ during training.

Error Definition Generation

When $e \notin E^{K}$, the task is to generate a definition $d$ conditioned on the identified set of related dialogue contexts $C_{e} \subseteq C^{U}$.

In practical implementations, this new data can be used to enhance $\mathcal{H}$. To avoid the emergence of an overly granular set of behavioral errors, we suggest applying a threshold to $\left |C_e \right |$.

Citation

Please reference our work as follows:

@article{petrak2025towards,
  title={Towards Automated Error Detection: A Study in Conversational AI},
  author={Petrak, Dominic and Tran, Thy and Gurevych, Iryna},
  journal={arXiv preprint arXiv:2509.10833},
  year={2025},
  url={http://arxiv.org/abs/2509.10833}
}

Contact Persons

Dominic Petrak ([email protected])

Links

UKP Lab Homepage | TU Darmstadt Website

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
scripts		scripts
src		src
static		static
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Automated Error Discovery: A Study in Conversational AI

Disclaimer

Abstract

Getting Started

Automated Error Discovery

Error Detection

Error Definition Generation

Citation

Contact Persons

Links

About

Uh oh!

Releases

Packages

Languages

License

UKPLab/emnlp2025-automatic-error-discovery

Folders and files

Latest commit

History

Repository files navigation

Towards Automated Error Discovery: A Study in Conversational AI

Disclaimer

Abstract

Getting Started

Automated Error Discovery

Error Detection

Error Definition Generation

Citation

Contact Persons

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages