Hugging Face and ServiceNow expand EVA-Bench voice benchmark to three enterprise domains

Hugging Face and ServiceNow have released a new version of EVA-Bench, broadening the voice-agent evaluation dataset from a single enterprise setting to three distinct domains. The updated benchmark now covers airline customer service, enterprise IT service management and healthcare HR service delivery, giving developers a wider range of scenarios to test enterprise voice systems.

The organizations said EVA-Bench 2.0 includes 213 evaluation scenarios and 121 tools, a substantial increase from the original release. The new version is intended to better reflect the variety of real call-center and service-desk interactions that voice agents may face in production.

Broader coverage for enterprise voice agents

The three domains are designed to stress different kinds of conversational workflows. Airline customer service management includes travel-related calls such as rebooking and other passenger support tasks. Enterprise IT service management focuses on workplace support and troubleshooting. Healthcare HR service delivery brings in benefits, policy and administration workflows grounded in U.S. healthcare-related processes.

According to the release, the dataset is meant to expose voice systems to realistic domain-specific language, workflow complexity and user expectations. The creators noted that a system performing well in one environment may struggle in another, especially when policies, terminology and authentication requirements change.

The new dataset is available as open source and can be downloaded through Hugging Face. The release also points to associated project resources, including a website, paper, GitHub repository, demo and dataset page.

Built around realism and reproducibility

The EVA-Bench team said the benchmark was designed around five core principles: voice-first scope, realism, variety, authentication and reproducibility.

Voice-first scope means the scenarios were selected only from tasks that are typically handled over the phone. Realism was emphasized through tool schemas modeled after production APIs and policy choices based on enterprise constraints. In the healthcare domain, that included references to U.S. systems and concepts such as NPI numbers, FMLA and insurance coverage.

Variety is another major design goal. Instead of repeating similar requests, the benchmark mixes single-intent and multi-intent calls, including conversations with up to four goals in a single interaction. It also includes adversarial scenarios where callers try to bypass troubleshooting, misstate urgency or access data they should not see. Some scenarios are intentionally unsatisfiable to better reflect real support traffic.

Authentication is built into every domain because it is a common point of failure for voice agents. The team said the exact authentication method changes depending on the workflow, rather than being applied uniformly across all tasks.

Reproducibility was treated as essential for reliable scoring. The benchmark was built so that each scenario has one correct resolution path, reducing ambiguity in evaluation results.

Synthetic generation with manual checks

The scenarios were generated with SyGra, a graph-based synthetic data pipeline using GPT-5.4. The release says each scenario is built from three linked components: the user goal, the initial scenario database and the expected final database state. These pieces are generated together so they remain consistent.

The team said the system then runs several validation steps, including structural checks and LLM-based reviews to confirm internal consistency and correct action sequences. After synthetic generation, scenarios also underwent manual review to make sure policies were applied consistently and that each user goal had a single valid outcome.

The authors say EVA-Bench is intended both for people evaluating voice agents and for teams building their own benchmarking datasets. They also said a multilingual extension is planned, which would expand the benchmark beyond English-only enterprise use cases.

For now, the release marks a significant expansion of the project’s scope, turning EVA-Bench into a broader testbed for enterprise voice automation across multiple industries and workflows.