All demos on this page are NOT cherry-picked. We arbitrarily choose the first five stimulus for demonstration.
Abstract
While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
I. Problem Identification
Accent Haullucination/Mismatch in ZS-TTS
In each of the following group, we input a fixed reference text-speech pair to the open-source VALL-E X with different target texts. The reference text-speech pairs are the 24th utterance from different test speakers of different accents in VCTK. The target texts are the first five stimulus in the elicitation passage Comma Gets a Cure. We also provide the accent prediction results by GenAID. Note that GenAID is not trained on any synthesised speech and therefore the results may be inaccurate - but it does demonstrate the inconsistency of accents in the generated speech.
Group I: Prompting VALL-E X with Scottish Speaker (p234 in VCTK)
Reference Speech
Reference Text
This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.
Target Text
VALL-E X Generation
GenAID Prediction
1. Well, here's a story for you.
Australian
2. Sarah Perry was a veterinary nurse who had been working daily at an old zoo in a deserted district of the territory.
Australian
3. So, she was very happy to start a new job at a superb private practice in North Square near the Duke Street Tower.
American
4. That area was much nearer for her and more to her liking.
English
5. Even so, on her first morning, she felt stressed.
English
Group II: Prompting VALL-E X with Irish Speaker (p245 in VCTK)
Reference Speech
Reference Text
This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.
Target Text
VALL-E X Generation
GenAID Prediction
1. Well, here's a story for you.
Australian
2. Sarah Perry was a veterinary nurse who had been working daily at an old zoo in a deserted district of the territory.
Canadian
3. So, she was very happy to start a new job at a superb private practice in North Square near the Duke Street Tower.
American
4. That area was much nearer for her and more to her liking.
American
5. Even so, on her first morning, she felt stressed.
Canadian
II. Model Architecture
Stage I: GenAID - Generalisable Accent Identification across Speakers