Evaluation: The performance of seen and unseen speech style transfer is evaluated in speech naturalness, style similarity, and speaker similarity. For each system, we randomly select the reading (standard reading speech) style as the seen style and make three experiments: from Reading style to Customer-service (spontaneous speech with fast speed) style (R2C), Reading style to Poetry (classical Chinese poetry speech with rich prosody variations) style (R2P), and Reading style to Game (exaggerated speech for role dubbing in the game) style (R2G). We randomly choose an unique Taiwanese-reading (Taiwanese accent reading speech) style from a new female speaker as the unseen style, and conduct three tests from Taiwanese-reading style to Customer-service style (TR2C), Taiwanese-reading style to Poetry style (TR2P), and Taiwanese-reading style to Game style (TR2G) to assess the performance of unseen style transfer.
Transfer types: Seen style transfer Unseen style transfer R2C R2P R2G TR2C TR2P TR2G GST: VAE: MRF-IT: MRF-ACC: Proposed: Short summary: The proposed model outperforms the reference models on both seen and unseen style transfer tasks. The performance of the proposed model on unseen style transfer is much better than other models. The results show a better generalization of the proposed model on the unseen style transfer.
Source style: R Source style: TR Target style: C Target style: P Target style: G
Transfer types: Seen style transfer Unseen style transfer R2C R2P R2G TR2C TR2P TR2G GST: VAE: MRF-IT: MRF-ACC: Proposed: Short summary: The listeners give preference to the proposed system, showing the proposed method improves performance of style transfer. For the unseen style, we find that GST, VAE and MRF-IT models, in most cases, fail to transfer unseen style of the Taiwanese-reading style to the target style of customer-service or poetry or game style. The best reference system, MRF-ACC, is still significantly inferior to the proposed model in the style similarity test.
Source speaker: R Source speaker: TR Transfer types: Seen style transfer Unseen style transfer R2C R2P R2G TR2C TR2P TR2G GST: VAE: MRF-IT: MRF-ACC: Proposed: Short summary: The proposed model delivers a higher similarity than the other models. On the unseen style transfer of TR2C, TR2P and TR2G, the GST, VAE and MRF-IT models are not capable of keeping the speaker's timbre, resulting in lower similarity scores. The MRF-ACC system, the best of all references, still performs significantly worse than the proposed model.