Autotuned voice cloning enabling multilingualism by IRJET Journal

International Research Journal of Engineering and Technology (IRJET) Volume: 09 Issue: 11 | Nov 2022

www.irjet.net

e-ISSN: 2395-0056 p-ISSN: 2395-0072

Autotuned voice cloning enabling multilingualism Prof. Priyadarshani Doke1, Piyush Jaiswal2, Neha Karmal3, Vivek Patil4, Samnan Shaikh5 ALARD COLLEGE OF ENGINEERING & MANAGEMENT (ALARD Knowledge Park, Survey No. 50, Marunji, Near Rajiv Gandhi IT Park, Hinjewadi, Pune-411057) Approved by AICTE. Recognized by DTE. NAAC Accredited. Affiliated to SPPU (Pune University). ---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - This article describes a neural network-based

and pitch and smoothening the voice. At this time, composed of 60+ languages. According to the paper, the latest multilingual text-to-speech systems require a large amount of data for training or handling only two to three languages, but in this model, training with a small amount of data through deep learning technique enabled the high performance of synthetic sounds and stable voice-cloning between multiple languages (English, French, Chinese and Russian).

text-to-speech (TTS) synthesis system that can generate spoken audio in a variety of speaker voices, including those not seen during training. We show that the proposed model can convert natural-language text-to-speech into a target language, and synthesize and translate natural text-to-speech. We quantify the importance of trained voice modules to obtain the best generalization performance. Finally, using randomly selected speaker embeddings, we show that speech can be synthesized with new speaker voices used in training and that the model learned high-quality speaker representations. We have also introduced a multilingual system and auto-tuner that allows you to translate regular text into another language, which makes multilingualization possible

1.2 TTS The goal of this paper is to make a TTS system that can induce natural speech for a variety of speakers in a dataeffective manner. Speech synthesis is a technology that allows a computer to convert written text into speech via a microphone or telephone. As an arising technology, not all inventors are familiar with speech technology. We specifically address a zero-shot literacy setting, where many seconds of un-transcribed reference audio from a target speaker is used to synthesize new speech in that speaker’s voice, without streamlining any model parameters. Still, it's also important to note the eventuality for abuse of this technology, for illustration impersonating someone’s voice without their concurrence. To address safety enterprises harmonious with principles similar, we corroborate that voices generated by the proposed model can fluently be distinguished from real voices.

Key Words: (Text to speech, Speech Synthesizer, Voice Cloning, Auto-tuner, Multilinguism) …

1. Introduction Voice cloning is the process in which one uses a computer to generate the speech of a real individual, creating a clone of their specific, unique voice using neural networks. A text-tospeech (TTS) system simply converts text to speech. In this project we are using TTS systems which are trained with datasets composed of texts and audio, thus, the system learns the sound (e.g., the waveform) of letters, words, and sentences. However, the resulting voice is the same as the one presented in the training dataset, which means that to produce a specific voice the TTS system needs to be trained with the target voice. Text is normal voice. Synthetic speech can be generated by concatenating recorded speech segments. In addition, synthesizers can combine voice models and other features of the human voice to produce a fully "synthesized" speech output.

1.3 Text-To-Speech Synthesis A speech synthesis system is by description a system, which produces synthetic speech. It's implicitly clear, that this involves some kind of input. What isn't clear is the type of this input. However, which doesn't contain fresh phonetic and/ or phonological information the system may be called a Text-To-Speech (TTS) system, If the input is plain text. As shown, the conflation starts from text input. currently this may be plain text or pronounced-up text e.g., HTML or commodity analogous like JSML (Java Synthesis Mark- up Language).

1.1 Voice Cloning Voice cloning is the process in which one uses a computer to generate the speech of a real individual, creating a clone of their specific, unique voice using neural networks. This model is composed of an encoder, and decoder and converts the text into audio using a vocoder. After receiving the text data the model detects the endpoint and evaluates the voice according to the condition that the voice is detected clearly or not. We are also using an auto-tuner for altering the tone,

Impact Factor value: 7.529

1.4 Auto Tuner Auto-Tune uses a proprietary device to measure and alter the pitch of vocal and instrumental music recordings and performances. The training data consists of performance

ISO 9001:2008 Certified Journal

Page 739