This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
Although LLM-based conversational agents demonstrate strong fluency and coherence, they continue to exhibit behavioral errors, such as inconsistencies and factual inaccuracies. Detecting and mitigating these errors is critical for developing trustworthy systems. However, current response correction methods rely heavily on large language models (LLMs), which require information about the nature of an error or hints about its occurrence for accurate detection. This limits their ability to identify errors not defined in their instructions or covered by external tools, such as those arising from updates to the response-generation model or shifts in user behavior.
Figure 1: Feedback-guided response generation: (1) The response-generation model produces an initial response. (2) The feedback LLM, or in self-correcting systems the response-generation model itself, evaluates the response for errors, often using external tools. Recent work shows that LLMs require information about the nature of an error or hints about its occurrence for accurate detection. (3) The feedback LLM provides guidance (feedback) to the response-generation model to refine its output. (4) The final response is presented to the user.
In this work, we introduce Automated Error Discovery, a framework for detecting and defining behavioral errors in conversational AI, and propose SEEED (Soft-clustering Extended Encoder-Based Error Detection), an encoder-based alternative to LLMs for error detection. We enhance the Soft Nearest Neighbor Loss by amplifying distance weighting for negative samples and introduce Label-Based Sample Ranking to select highly contrastive examples for better representation learning.
Figure 2: Schematic overview of SEEED, comprising three components: Summary Generation, Error Detection, and Error Definition Generation (e denotes the identified error type). In practical applications (see Figure 1 for an example), the feedback LLM may be used for generating summaries and error definitions (if necessary) to reduce deployment costs, as both are summarization tasks typically covered during LLM pre-training. Newly defined error types are added to the pool of known types, and their dialogue contexts could be used to enhance error detection.
SEEED outperforms adapted baselines across multiple error-annotated dialogue datasets, improving the accuracy for detecting novel behavioral errors by up to 8 points and demonstrating strong generalization to unknown intent detection.
In this repository, we provide the source code for SEEED and the baselines used in the paper. In addition, we provide example shell scripts for running our experiments. For experiments with FEDI, we used the error-annotated subset of FEDI v2, available at TU datalib. For experiments with Soda-Eval, we used the dataset as published in the Huggingface Datahub, and for experiments with ABCEval, we used the dataset provided on GitHub.
1. Installation
We provide a pyproject.toml
file for installing our code as a package in your Python >=3.12 environment. Just run pip install -e .
to install our code and all required dependencies.
2. Running Our Code
Navigate to scripts
. There you will find one .sh
file for each approach, SEEED, SynCID, LOOP, KNN-Contrastive, and our fine-tuning experiments with Phi-4. Remove the SLURM specific parameters if you are not running the scripts in a SLURM environment, set the required parameters according to your environment, create the required directories, and run the script (or copy the Python command to your shell). Here is a list of all possible parameters:
Argument | Description | Approach |
---|---|---|
--dataset | The path or to the dataset to use (from a local directory or Huggingface). | |
--token | The token to use for downloading the dataset from Huggingface. | |
--novelty | The ration of classes to randomly sample as novel. | |
--n_labels | The list of classes to treat as novel (alternative to 'novelty'), e.g., for a subsequent run in a multi-stage training approach. Will always override 'novelty'. | |
--error_turn | Whether or not to use the whole error turn instead of just the error utterance. | |
--model_path | The path to a pretrained huggingface model (the foundation model to be used). | |
--pretrained | The path to a model pretrained using this framework. | |
--batch_size | The number of samples per batch. | |
--epochs | The number of epochs for the main training. | |
--epochs_2 | The number of epochs for the second training stage (if required). | SynCID, LOOP |
--device | The device to use for training ('cuda' or 'cpu'). | |
--save_dir | The directory where to save the trained model. | |
--experiment | The name of the experiment (for MLFlow). | |
--visualize | Whether or not to visualize the representation space after evaluation. | |
--contrastive_weighting | Weighting factor for contrastive learning. | SEEED, KNNContrastive |
--topk | The number of top-k samples to consider for calculating inconsistency. | SEEED |
--positives | The number of positive samples per error type | SEEED, KNNContrastive |
--num_negatives | The number of negative samples per error type. | SEEED |
--resample_rate | The ration of epochs after which to resample the dataset for contrastive learning. | SEEED |
--error_types | The data type for loading the correct error definitions and templates | Phi-4, LOOP |
--alpha | The weighting factor for the unsupervised contrastive loss. | SynCID |
--beta | The weighting factor for the supervised contrastive loss. | SynCID |
In our experiments with SEEED, we only used one positive and negative counterpart per sample, but theoretically, there is no upper limit.
We distinguish two sub-tasks, Error Detection and Error Definition Generation, and define the following formal setup:
-
$E = E^K \cup E^U$ is the set of all behavioral error types.$E^K = {(e_i, d_i)}_{i=1}^m$ is the set of known error types, with$e_i$ as the error identifier and$d_i$ as its definition.$E^U$ denotes the set of unknown error types.$E^K \cap E^U = \emptyset$ . -
$C = C^K \cup C^U$ denotes the set of all dialogue contexts$T$ , with$C^{K}$ as the set of all$T$ associated with a behavioral error$e$ from$E^{K}$ .$C^{U}$ is the set of dialogues associated with unknown behavioral errors.$C^K \cap C^U = \emptyset$ . - We define a dialogue context
$T$ as a sequence of user-agent utterances (turns). Depending on the use case,$T$ may be associated with additional features, such as external knowledge documents in knowledge-grounded dialogues. We refer to these additional features as$W_T$ . In this work,$W$ is relevant only as external knowledge in the knowledge-grounded subset of FEDI.
Given an error detection function
When
In practical implementations, this new data can be used to enhance
Please reference our work as follows:
@article{petrak2025towards,
title={Towards Automated Error Detection: A Study in Conversational AI},
author={Petrak, Dominic and Tran, Thy and Gurevych, Iryna},
journal={arXiv preprint arXiv:2509.10833},
year={2025},
url={http://arxiv.org/abs/2509.10833}
}
Dominic Petrak ([email protected])