Rendering Of Voice By Using Convolutional Neural Network And With The Help Of Text-To-Speech Module

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 09 Issue: 07 | July 2022

p-ISSN: 2395-0072

www.irjet.net

Rendering Of Voice By Using Convolutional Neural Network And With The Help Of Text-To-Speech Module Dhiren Patel, Aashay Ingle, Saurabh Joshi, Mandar Hegiste 1.B.sc

Computer Science Graduate, Mumbai University, Mithibai College, Maharashtra - 400056 Engineering Graduate, Mumbai University, Viva College, Maharashtra - 401305 3.Software Engineer Graduate, Mumbai University, Xavier Institute of Technology, Maharashtra 4.Mechanical Engineering Graduate, Mumbai University, Viva College, Maharashtra - 401305 ---------------------------------------------------------------------***--------------------------------------------------------------------and the output; this type of technique is sometimes called Abstract - This paper describes a novel text-to-speech 2.Mechanical

‘end-to-end’ learning. Although such a technique is sometimes criticized as ‘a black box,’ nevertheless, an end-toend TTS system named Tacotron, which directly estimates a spectrogram from an in- put text, has achieved promising performance recently, without intensively-engineered parametric models based on domain-specific knowledge. Tacotron, however, has a drawback that it exploits many recurrent units, which are quite costly to train, making it almost infeasible for ordinary labs without luxurious machines to study and extend it further. Indeed, some people tried to implement open clones of Tacotron but they are struggling to reproduce the speech of satisfactory quality as clear as the original work.

(TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units. Recurrent neural network (RNN) has been a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN component often requires a very powerful computer, or very long time typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show an alternative neural TTS system, based only on CNN, that can alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS can be sufficiently trained only in a night (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.

The purpose of this paper is to show Deep Convolutional TTS (DCTTS), a novel, handy neural TTS, which is fully convolutional. The architecture is largely similar to Tacotron but is based on a fully convolutional sequence-to-sequence learning model similar to the literature We show this handy TTS actually works in a reasonable setting. The contribution of this article is twofold: (1) Propose a fully CNN-based TTS system which can be trained much faster than an RNN-based state-of-the-art neural TTS system, while the sound quality is still acceptable. (2) An idea to rapidly train the attention, which we call ‘guided attention,’ is also shown.

Key

Words: Text-to-speech, deep learning, convolutional neural network, attention, sequence-tosequence learning. 1. INTRODUCTION Text-to-speech (TTS) is getting more and more common recently, and is getting to be a basic user interface for many systems. To encourage further use of TTS in various systems, it is significant to develop a handy, maintainable, extensible TTS component that is accessible to speech non-specialists, enterprising individuals and small teams who do not have massive computers.

1.1 Related Work Neural speech synthesis: Recently, there is a surge of interest in speech synthesis with neural networks, including Deep Voice 1 [Arik et al., 2017a], Deep Voice 2 [Arik et al., 2017b], Deep Voice 3 [Ping et al., 2018], WaveNet [Oord et al., 2016a], SampleRNN [Mehri et al., 2016], Char2Wav [Sotelo et al., 2017], Tacotron [Wang et al., 2017] and VoiceLoop [Taigman et al., 2018]. Among these methods, sequence-to-sequence models [Ping et al., 2018, Wang et al., 2017, Sotelo et al., 2017] with attention mechanism have much simpler pipeline and can produce more natural speech [e.g., Shen et al., 2017]. In this work, we use Deep Voice 3 as the baseline multi-speaker model, because of its simple convolutional architecture and high efficiency for training and fast model adaptation. It should be noted that our techniques can be seamlessly applied to other neural speech synthesis models.

Traditional TTS systems, however, are not necessarily friendly for them, as these systems are typically composed of many domain- specific modules. For example, a typical parametric TTS system is an elaborate integration of many modules e.g. a text analyzer F0 generator, a spectrum generator, a pause estimator, and a vocoder that synthesize a waveform from these data, etc. Deep learning sometimes can unite these internal building blocks into a single model, and directly connects the input

© 2022, IRJET

|

Impact Factor value: 7.529

|

ISO 9001:2008 Certified Journal

|

Page 880


Turn static files into dynamic content formats.

Create a flipbook