Multi-modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Arxiv

Demo

Code

Abstract

In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.

The Architecture of Our Proposed Work

AVSE

Objective Evaluation

The results of our proposed work, compared with previous benchmarks, are shown below.

AVSE

Ablation Studies

We conduct extensive ablation studies to prove the efficacy of each module in our proposed work, as shown below.

AVSE

Subjective Evaluation

We conduct an AB Preference evaluation on an open-source corpus of an unseen speaker, LJSpeech, using the Multi-modal Baseline and Proposed Work to annotate the prosodic boundaries of the entire corpus automatically. Both systems are trained using Conformer-FastSpeech2 and Hifi-GAN, with a 16-dimension embedding encoding the prosodic boundary labels concatenated to the phone embedding as input. 30 native speakers are asked to compare the 30 utterances from each system and choose the one they prefer. Our proposed work achieves a 7.58\% improvement in AB Preference compared with the previous benchmark.

AVSE

Here are part of the speech for MOS test. Besides speech generated by the two TTS systems mentioned above, we also give the raw speech recorded by human, the text, and predicted prosody for reference.

Samples Demo

Demo1 LJ047-0165
Text At the end of the interview, Marina Oswald came into the room, when he observed that she seemed, quote, quite alarmed, end quote, about the visit.
Prosody At the end of the interview, [PW] Marina Oswald came into the room, [PPH] when he observed that she seemed, [PW] quote, [PW] quite alarmed, [PW] end quote, [PW] about the visit. [IPH]
Multi-modal Baseline Proposed Work Ground Truth
Demo2 LJ002-0227
Text Absence or neglect of divine service, were present as in the King's Bench, but in an exaggerated form.
Prosody Absence or neglect of divine service, [PPH] were present [PW] as in the King's Bench, [PPH] but [PPH] in an exaggerated form. [IPH]
Multi-modal Baseline Proposed Work Ground Truth
Demo3 LJ006-0028
Text Mister Crawford was thoroughly versed in the still imperfectly understood science of prison management, and fully qualified for his new duties.
Prosody Mister Crawford was thoroughly versed [PW] in the still imperfectly understood science of prison management, [PPH] and fully qualified [PW] for his new duties. [IPH]
Multi-modal Baseline Proposed Work Ground Truth
Demo4 LJ006-0140
Text Evidence was given before the inspectors of eight or ten prisoners seen giddy drunk, not able to sit upon forms.
Prosody Evidence was given before the inspectors of eight or ten prisoners [PW] seen [PW] giddy drunk, [PPH] not able to sit upon forms. [IPH]
Multi-modal Baseline Proposed Work Ground Truth
Demo5 LJ007-0121
Text There were no restraints, cards and backgammon were played, and the time passed in feasting and revelry.
Prosody There were no restraints, [PPH] cards and backgammon were played, [PPH] and the time passed in feasting and revelry. [IPH]
Multi-modal Baseline Proposed Work Ground Truth
Demo6 LJ013-0191
Text All this was evidence sufficient to warrant Courvoisier's committal for trial;
Prosody All [PW] this was evidence sufficient to warrant Courvoisier's committal for trial; [IPH]
Multi-modal Baseline Proposed Work Ground Truth