Model Overview

"Betraying" Scottish Accents

Note: "Betraying" is a metaphorical way of saying these systems generate hallucianted accents, misrepresenting the voice identities of Scottish/English users.

We asked two Current TTS systems and our AccentBox, to mimic a Scottish speaker's voice.

Mimic the voice/accent in the following audio:

Transcription: This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.

Generated Audios:
Text Current TTS 1 Current TTS 2 AccentBox
Well, here's a story for you.
Sarah Perry was a veterinary nurse who had been working daily at an old zoo in a deserted district of the territory.
So, she was very happy to start a new job at a superb private practice in North Square near the Duke Street Tower.
That area was much nearer for her and more to her liking.
Even so, on her first morning, she felt stressed.

"Betraying" English Accents

We asked two Current TTS systems and our AccentBox, to mimic an English speaker's voice.

Mimic the voice/accent in the following audio:

Transcription: This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.

Generated Audios:
Text Current TTS 1 Current TTS 2 AccentBox
Well, here's a story for you.
Sarah Perry was a veterinary nurse who had been working daily at an old zoo in a deserted district of the territory.
So, she was very happy to start a new job at a superb private practice in North Square near the Duke Street Tower.
That area was much nearer for her and more to her liking.
Even so, on her first morning, she felt stressed.

Accent Conversion

We asked our AccentBox to generate speech in the English speaker's voice in various accents.

Mimic the speaker's voice in the following audio:

Transcription: This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.

Then convert to various accents!

Generated Audios:
Text Original English American Australian Irish Scottish
Once Sarah had managed to bathe the goose, she wiped her off with a cloth and laid her on her right side.
Then Sarah confirmed the vet’s diagnosis.
Almost immediately, she remembered an effective treatment that required her to measure out a lot of medicine.
Sarah warned that this course of treatment might be expensive - either five or six times the cost of penicillin.
I can't imagine paying so much, but Mrs. Harrison - a millionaire lawyer - thought it was a fair price for a cure.

Difficulty in Evaluation

1. Why Evaluation is Needed?

“In machine learning, it's not just what you build - it's how well you measure it. Without rigorous evaluation, a model is just a guess.”

We heavily rely on human listeners to provide us with high-quality, accurate, and meaningful feedbacks, so that we can build better TTS systems.

2. Vague Feedbacks on Accent

We did a naive listening test and asked listeners to comment on the clues they used to decide which one is better.

A sample listening test question

Instruction: Listen carefully to all speech recordings below in full. Then pick the candidate speech recording that is more similar in terms of accent to the reference speech recording. Please disregard the mismatch in voice, gender, and audio quality.
Here is the reference speech recording:



These are some of the comments we got. I was not sure how I could use this to meaningfully improve our system.

"very close but flow on accent slightly better here"
"flow more natural like the reference (speech)"
"close but slightly better flow"
"Mixture between American and English accent"
"The speed overall and tone sounds the same (as the reference speech)"
"More accurate speech pattern (than the other candidate speech)"
...



3. A Novel Listening Test Design

Rather than an open-ended text entry box, we gave listeners this highlighting task.

This is what we got in the end. The darker the color, the more listeners selected these parts (as clues for evaluating accent similarity).

Academic Outputs

[1] J. Zhong, K. Richmond, Z. Su and S. Sun, "AccentBox: Towards High-Fidelity Zero-Shot Accent Generation," Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025, 1-5, doi: 10.1109/ICASSP49660.2025.10888332.
Paper, Arxiv, Code, Video, Slides, Poster
[2] J. Zhong, S. Liu, D. Wells, and K. Richmond, "Pairwise Evaluation of Accent Similarity in Speech Synthesis," Submitted to Proc. Interspeech 2025
To be released after the anonymity period ends.