AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

Arxiv

Demo

All demos on this page are NOT cherry-picked. We arbitrarily choose the first five stimulus for demonstration.

Abstract

While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.

I. Problem Identification

Accent Haullucination/Mismatch in ZS-TTS

In each of the following group, we input a fixed reference text-speech pair to the open-source VALL-E X with different target texts. The reference text-speech pairs are the 24th utterance from different test speakers of different accents in VCTK. The target texts are the first five stimulus in the elicitation passage Comma Gets a Cure. We also provide the accent prediction results by GenAID. Note that GenAID is not trained on any synthesised speech and therefore the results may be inaccurate - but it does demonstrate the inconsistency of accents in the generated speech.

Group I: Prompting VALL-E X with Scottish Speaker (p234 in VCTK)

Reference Speech Reference Text
This is a very common type of bow, one showing mainly
red and yellow, with little or no green or blue.


Target Text VALL-E X Generation GenAID Prediction
1. Well, here's a story for you. Australian
2. Sarah Perry was a veterinary nurse who had been working daily
at an old zoo in a deserted district of the territory.
Australian
3. So, she was very happy to start a new job at a superb private
practice in North Square near the Duke Street Tower.
American
4. That area was much nearer for her and more to her liking. English
5. Even so, on her first morning, she felt stressed. English

Group II: Prompting VALL-E X with Irish Speaker (p245 in VCTK)

Reference Speech Reference Text
This is a very common type of bow, one showing mainly
red and yellow, with little or no green or blue.


Target Text VALL-E X Generation GenAID Prediction
1. Well, here's a story for you. Australian
2. Sarah Perry was a veterinary nurse who had been working daily
at an old zoo in a deserted district of the territory.
Canadian
3. So, she was very happy to start a new job at a superb private
practice in North Square near the Duke Street Tower.
American
4. That area was much nearer for her and more to her liking. American
5. Even so, on her first morning, she felt stressed. Canadian

II. Model Architecture

Stage I: GenAID - Generalisable Accent Identification across Speakers

AVSE

Stage II: AccentBox - High-Fidelity Zero-Shot Accent Generation

AVSE

III. Results

Accent Identification Results

AVSE

Zero-shot Accent Generation Results

AVSE

AVSE

AVSE

IV. Synthesis Samples

Inherent Accent Generation

Reference Speech (Speaker & Accent) Note
speaker: p225
accent: English
Stimuli Baseline Accent_ID Proposed
#1
#2
#3
#4
#5

Reference Speech (Speaker & Accent) Note
speaker: p294
accent: American
Stimuli Baseline Accent_ID Proposed
#1
#2
#3
#4
#5

Reference Speech (Speaker & Accent) Note
speaker: p245
accent: Irish
Stimuli Baseline Accent_ID Proposed
#1
#2
#3
#4
#5

Cross Accent Generation

Note that Baseline system cannot perform cross accent generation.

All following samples in cross accent generation take this reference speech for speaker information.
Reference Speech (Speaker) Note
speaker: p225
accent: English

Conversion to American accent:
Reference Speech (Accent) Note
speaker: p294
accent: American
Stimuli Accent_ID Proposed
#1
#2
#3
#4
#5

Coversion to Irish accent:
Reference Speech (Accent) Note
speaker: p245
accent: Irish
Stimuli Accent_ID Proposed
#1
#2
#3
#4
#5

Unseen Accent Generation

Note that Accent_ID system cannot perform unseen accent generation.

Zero-shot New Zealand accent (unseen by AccentBox, seen by GenAID):
Reference Speech (Speaker & Accent) Note
speaker: p335
accent: New Zealand
Stimuli Baseline Proposed
#1
#2
#3
#4
#5

Zero-shot Welsh accent (unseen by both AccentBox and GenAID):
Reference Speech (Speaker & Accent) Note
speaker: p253
accent: Welsh
Stimuli Baseline Proposed
#1
#2
#3
#4
#5

Appendix

Appendix I: Data composition of AID data.

AVSE