Over half of the world's population speaks more than one language. And for many bilingual speakers, code-switching — seamlessly switching between languages, even mid-sentence — is a natural part of everyday communication. Whether in casual conversations, contact centers, or IT helpdesks, speakers fluidly adapt to whichever language feels most natural in the moment.
Despite the prevalence of bilingual speakers across the world, there has been little work focused on how voice agents handle code-switched speech in enterprise settings. So, when a customer asked us how our voice agents would perform for their largely bilingual customer base who routinely code-switched, we decided to build our own benchmark and dataset to evaluate models. We focused on automatic speech recognition (ASR) — the first step in any voice agent pipeline — because transcription errors propagate forward into every downstream component. In enterprise settings, where a misrouted ticket or misunderstood policy question has real operational consequences, getting the transcript right is an especially important step of the voice agent pipeline.
Our benchmark covers four language pairs that were most relevant for our customer base: Spanish-English, French-English, Canadian French-English, and German-English. It uses the non-English language as the matrix framing, with English embedded at varying lengths. The data covers a wide range of Human Resources (HR) and IT Service management (ITSM) scenarios, including employee inquiries about benefits or payroll, and support requests such as password resets, VPN access, or device troubleshooting. To measure how various models perform, we report three metrics: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). We choose these metrics to capture both (1) the models' exact accuracy in transcription, as well as (2) their ability to preserve the meaning of the utterance for downstream tasks.
We release our benchmark and data through our harness for evaluating voice models, AU-Harness. We also provide results from seven ASR systems, including some Large Audio Language Models (LALMs), frontier ASRs, and open-source ASRs. Our main finding is that the cost of codeswitching varies depending on the language-pair and model tested. ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro surface as the top models across metrics for the task.
We start with an internal corpus of IT support and HR interactions. To create each code-switched utterance, we begin with parallel user utterances in English and one of our four non-English languages, then filter for good code-switching candidates. We keep utterances between 12 and 40 words — short enough to be natural spoken turns, long enough to contain real switching opportunities. We also exclude utterances where entities dominate — emails, phone numbers, IDs, or URLs that make text half-English by necessity rather than bilingual choice. Finally, we require at least three switchable content words — nouns, verbs, or adjectives that are not entities or product names — to give the generation model enough material to produce a meaningful code-switched version.
From here, we tested various strategies for combining languages in a realistic way and ultimately selected a simple persona prompt sent to an LLM (OpenAI/GPT-5) to produce the code-switched text. We then used an LLM verbalization pass to convert the text into its spoken form and used ElevenLabs Multilingual V2 to synthesize the audio. Every utterance is then reviewed by an AI/NLP linguist who is a native speaker of the matrix language; flagged utterances are excluded or regenerated and re-reviewed. The final dataset has 259 Spanish-English records, 298 French-English records, 188 Canadian French-English records, and 173 German-English records

We report three metrics per model per language pair, chosen to capture transcription accuracy, meaning preservation, and downstream task performance:

We evaluated the following models:
We analyzed errors along two dimensions:
The differences between metrics become most meaningful when models diverge across them.
The semantic metrics tell a broadly similar story to the WER, with a few inversions.
The semantic results also reveal notable consistency between SWER and AER. The two metrics operate at different granularities — SWER aggregates error across every word, while AER measures whether three comprehension questions per utterance can be answered correctly — so differences in scale are expected. What's notable is how stable the relative model rankings are across both. The one clear outlier is Deepgram Nova-3, which sits mid-tier on SWER but ranks last or second-to-last on AER across all language pairs. The gap is most pronounced on Spanish-English: Nova-3's overall rate of semantic errors is lower than its error rate specifically on the details that matter most.
While these results provide a clear picture of relative model performance on code-switched speech, they do not reveal whether the errors stem from the inherent difficulty of transcription itself, or from the additional challenge introduced by language switching.
To isolate the cost of codeswitching, we ran every utterance through our evaluation pipeline in three audios: the code-switched audio, a monolingual matrix-language audio of the same content, and a monolingual English audio. For each utterance, we measured the difference in WER between the code-switched and monolingual conditions and aggregated the deltas across the benchmark. Below are the results.

Now that we know code-switching can cause models to make mistakes, we turn to investigating the specific conditions associated with those mistakes. To address this question, we fit a two-part model:
This two-part approach lets us distinguish between factors that make an error more likely to occur and factors that influence how large the error becomes once it has. Both steps include the same predictors: (1) the number of language switches in the utterance, and (2) the utterance's Code-Mixing Index (CMI) — the proportion of words drawn from a secondary language relative to the matrix language, following Gambäck and Das. We also include utterance length as a control, since longer utterances provide more opportunities for error.
From the first part of our model, we find that the number of language switches within an utterance is the predictor most consistently associated with whether the occurrence of a transcription error. Each language change appears to introduce an additional opportunity for the transcription process to fail. This relationship was significant in the French-English language pair in particular, where six out of seven models exhibited it. Other predictors — CMI and utterance length — showed few significant relationships with error occurrence.
When the question shifts to error magnitude, a different pattern emerges. Rather than switch count, CMI surfaces as the stronger predictor. In the German-English language pair specifically, four out of seven models showed a significant positive relationship between CMI and WER. This suggests that once errors occur, their severity is shaped not by how often the speaker switches languages but by the overall density of mixing: the more thoroughly an utterance interweaves the two languages, the larger the resulting transcription errors tend to be.
The two-part model explains what factors are associated with errors occurring and worsening. Our final experiment examines which portions of a code-switched utterance contribute disproportionately to those errors. To test whether errors distribute differently across the English and non-English parts of an utterance, we used GPT-5 to tag each word by language, then attributed each transcription error to the language of the word on which it occurred, computing a per-language WER. The heatmap below shows the results.
The pattern is consistent across all models and language pairs: errors concentrate on the English portions of utterances rather than the matrix-language portions. This is counterintuitive — English is the language these models tend to handle best in monolingual settings. One explanation is that English segments in code-switched speech may disproportionately contain technical vocabulary or named entities that are harder to transcribe. Another is that embedded-language segments create a challenging context regardless of which language is embedded: when a model transitions into a stretch of non-matrix speech, it must adapt to a different phonological and lexical register mid-utterance, increasing the likelihood of error at exactly that span.
This result suggests that transcription difficulty in code-switched ASR is not concentrated at switch points alone, but extends across embedded-language spans more broadly. Disentangling whether this pattern reflects the lexical characteristics of English segments, their structural role as embedded language, or current models' limited ability to adapt mid-utterance is a promising direction for future work.
Several limitations are worth acknowledging:
Code-switching has long been a stress test for voice models. Our results suggest that for the best frontier ASR systems, it is increasingly becoming a normal condition.
When enterprises choose their ASR systems carefully, bilingual customers can speak naturally — switching languages mid-sentence as the conversation demands — without sacrificing transcription quality or downstream task performance. The top models in our benchmark handle code-switched speech with surprisingly small penalties relative to their monolingual baselines, and the semantic metrics tell an even more encouraging story.
But the picture is not uniformly positive. Before making production decisions, you must benchmark the languages your customers actually speak — performance varies substantially across models and language pairs, and the best choice for Spanish–English speakers is not necessarily the best choice for German–English speakers.