Audio samples from "Disentangling Style and Speaker Attributes for TTS Style Transfer"

Authors: Xiaochun An, Frank K. Soong, Lei Xie
Abstract: End-to-end neural TTS has shown improved performance in speech style transfer. However, the improvement is still limited by the available training data in both target styles and speakers. Additionally, degenerated performance is observed when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to seen and unseen style transfer training on disjoint, multi-style datasets, i.e., datasets of different styles are recorded, one individual style by one speaker in multiple utterances. An inverse autoregressive flow (IAF) technique is first introduced to improve the variational inference for learning an expressive style representation. A speaker encoder network is then developed for learning a discriminative speaker embedding, which is jointly trained with the rest neural TTS modules. The proposed approach of seen and unseen style transfer is effectively trained with six specifically-designed objectives: reconstruction loss, adversarial loss, style distortion loss, cycle consistency loss, style classification loss, and speaker classification loss. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of our approach is superior to and more robust than those of four other reference systems of prior art.

Evaluation: The performance of seen and unseen speech style transfer is evaluated in speech naturalness, style similarity, and speaker similarity. For each system, we randomly select the reading (standard reading speech) style as the seen style and make three experiments: from Reading style to Customer-service (spontaneous speech with fast speed) style (R2C), Reading style to Poetry (classical Chinese poetry speech with rich prosody variations) style (R2P), and Reading style to Game (exaggerated speech for role dubbing in the game) style (R2G). We randomly choose an unique Taiwanese-reading (Taiwanese accent reading speech) style from a new female speaker as the unseen style, and conduct three tests from Taiwanese-reading style to Customer-service style (TR2C), Taiwanese-reading style to Poetry style (TR2P), and Taiwanese-reading style to Game style (TR2G) to assess the performance of unseen style transfer.

1. Comparing the speech naturalness of seen and unseen style transfer with the GST, VAE, MRF-IT, MRF-ACC and proposed models:

We compare the performance of all models in speech naturalness with the MOS and ABX listening tests, where listeners are asked to mark a sentence unintelligible when any part of it is unintelligible in listening.

Speech naturalness: bad --> good

Transfer types: Seen style transferUnseen style transfer
R2CR2PR2GTR2CTR2PTR2G
GST:
VAE:
MRF-IT:
MRF-ACC:
Proposed:

Short summary: The proposed model outperforms the reference models on both seen and unseen style transfer tasks. The performance of the proposed model on unseen style transfer is much better than other models. The results show a better generalization of the proposed model on the unseen style transfer.

2. Comparing the style similarity of seen and unseen style transfer with the GST, VAE, MRF-IT, MRF-ACC and proposed models:

We conduct an ABX test of style similarity to assess the style conversion performance, where listeners are asked to choose which speech sample sounds closer to the target style in terms of style expression.

Style similarity: low --> high

Source style: RSource style: TR
Target style: CTarget style: PTarget style: G
Transfer types: Seen style transferUnseen style transfer
R2CR2PR2GTR2CTR2PTR2G
GST:
VAE:
MRF-IT:
MRF-ACC:
Proposed:

Short summary: The listeners give preference to the proposed system, showing the proposed method improves performance of style transfer. For the unseen style, we find that GST, VAE and MRF-IT models, in most cases, fail to transfer unseen style of the Taiwanese-reading style to the target style of customer-service or poetry or game style. The best reference system, MRF-ACC, is still significantly inferior to the proposed model in the style similarity test.

3. Comparing the speaker similarity of seen and unseen style transfer with the GST, VAE, MRF-IT, MRF-ACC and proposed models:

We conduct CMOS tests between the proposed model and each reference system to evaluate how well the transferred speech matches that of the source speaker's timbre, where listeners are asked to select audio that represents a closer speaker to source audio.

Speaker similarity: low --> high

Source speaker: RSource speaker: TR
Transfer types: Seen style transferUnseen style transfer
R2CR2PR2GTR2CTR2PTR2G
GST:
VAE:
MRF-IT:
MRF-ACC:
Proposed:

Short summary: The proposed model delivers a higher similarity than the other models. On the unseen style transfer of TR2C, TR2P and TR2G, the GST, VAE and MRF-IT models are not capable of keeping the speaker's timbre, resulting in lower similarity scores. The MRF-ACC system, the best of all references, still performs significantly worse than the proposed model.