Distinguishing AI-Generated Voices from Human Voices Using Spectral Analysis by IRJET Journal

Distinguishing AI-Generated Voices from Human Voices Using Spectral Analysis

Agasthya Bhatia1

1Dhirubhai Ambani International School, Mumbai, Maharashtra, India ***

Abstract - This study investigates the effectiveness of spectral analysis techniques in distinguishing between AIgenerated and human voices. Using frequency spectrum data from three human voice samples and five AI voice generation systems (Apple Translate, Google Translate, ElevenLabs, Murf Labs, and Natural Readers), we conducted comprehensive spectral feature analysis including spectral centroid, bandwidth, rolloff, skewness, kurtosis, Shannon entropy, and high-frequency content ratios. Our findings reveal significant distinguishable patterns between human and AI voices, with particular differences in spectral centroid distribution, entropy levels, and high-frequency content. The analysis demonstrates that spectral analysis alone can provide moderate to strong distinguishing capabilities, with an overall classification potential of approximately 65-75%. Results show that human voices exhibit broader frequency utilization (4206 Hz average rolloff vs 2903 Hz for AI), higher spectral complexity (6.387 vs 6.017 bits entropy), and more natural high-frequency content (29.78% vs 21.96%). The study validates the hypothesis that human voices demonstrate more "free-flowing frequency patterns" compared to AI systems.

Key Words: Voicesynthesisdetection,spectralanalysis,AI voice generation, deepfake detection, digital forensics, frequencydomainanalysis

1.INTRODUCTION

TherapidadvancementofAIvoicesynthesistechnologyhas created an urgent need for reliable detection methods to distinguish artificial voices from human speech. Modern text-to-speech (TTS) systems can produce increasingly realistic syntheticvoices, raisingconcernsabout potential misuseindeepfakes,fraud,andmisinformationcampaigns [1].Thisstudyaddressesthefundamentalresearchquestion: Can spectral analysis alone effectively distinguish between AI-generated and human voices?

Previous research has explored various approaches to synthetic voice detection, including machine learning techniqueswithcomplexneuralnetworksandmulti-modal analysis[2][3].However,thecomputational complexity of thesemethodsoftenlimitsreal-timeapplications.Thisstudy focusesspecificallyonstatisticalspectralanalysismethods that could provide efficient, interpretable, and implementablesolutionsforvoiceauthenticityverification.

Themotivationforthisresearchstemsfromtheincreasing sophistication of AI voice synthesis systems and their potentialmisuseinvariousapplications.Recentadvancesin neural voice synthesis, particularly with models like WaveNet, Tacotron, and more recent transformer-based approaches,havemadeitincreasinglydifficulttodistinguish synthetic speech from human speech using traditional methods[4][5].

1.1 Research Objectives

Theprimaryobjectivesofthisstudyare:

 To identify unique spectral characteristics that differentiateAI-generatedandhumanvoices

 Toevaluatetheeffectivenessofstatisticalspectral analysisforvoiceauthentication

 Todevelopthreshold-basedclassificationmethods usingspectralfeatures

 Tovalidatethehypothesisthathumanvoicesshow morenaturalfrequencydistributionpatterns

1.2 Research Scope

Thisresearchfocuseson analyzingfrequencyspectrum characteristicsofacontrolleddatasetusingthesamespeech content(MartinLutherKingJr.'s"IHaveaDream"speech excerpt) across all samples to ensure content consistency andeliminatecontent-basedvariationsintheanalysis.

2. LITERATURE REVIEW

Spectral analysis has been widely used in audio signal processing for voice characterization. The detection of syntheticspeechhasbecomeincreasinglyimportantwiththe advancementofneuralvoicesynthesistechnologies[6].

2.1 Spectral Features in Voice Analysis

Keyspectralfeaturescommonlyemployedinvoiceanalysis include[7][8]:

Spectral Centroid: Representsthe"centerofmass" ofthe spectrum,perceptuallyrelatedtothebrightnessofsound.It

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net p-ISSN: 2395-0072

iscalculatedastheweightedmeanoffrequenciespresentin thesignal[9].

Spectral Bandwidth: Measures the spread of frequencies around the centroid, indicating the width of the spectral distribution[10].

Spectral Rolloff: The frequency below which a specified percentage(typically85%)ofspectralenergyiscontained. This feature is particularly useful for characterizing the spectralshape[11].

Statistical Moments: Skewness and kurtosis describe the shape and distribution characteristics of the spectrum. Skewnessmeasurestheasymmetryofthedistribution,while kurtosismeasuresthe"tailedness"[12][13].

Shannon Entropy: Quantifies the randomness and complexity of the spectral distribution, providing insights intothepredictabilityofthespectralcontent[14].

High-Frequency Content: The proportion of energy in higherfrequencybands,whichcanrevealcharacteristicsof naturalhumanvocalproduction[15].

2.2 AI Voice Detection Research

RecentstudieshaveindicatedthatAIvoicesynthesissystems often exhibitsubtle but measurable differencesin spectral characteristicscomparedtonaturalhumanspeech[16][17].

Chen et al. [18] demonstrated that convolutional neural networks could effectively detect synthetic speech by analyzingspectrograms,achievingaccuracyratesabove90%. However, their approach required extensive training data andcomputationalresources.

Wang et al. [19] explored the use of prosodic features combined with spectral analysis for synthetic speech detection, showing that combining multiple feature types improved detection accuracy. Their work highlighted the importanceoftemporal dynamicsinadditiontofrequency domainfeatures.

2.3 Challenges in Synthetic Voice Detection

The main challenges in synthetic voice detection include [20][21]:

 RapidlyevolvingAIsynthesistechnologies

 Adaptation of synthesis systems to mimic human imperfections

 Computational complexity of deep learning approaches

 Limitedavailabilityofdiversetrainingdatasets

 Generalizationacrossdifferentsynthesismethods

3. METHODOLOGY

3.1

Data Collection

VoiceSamples:Wecollectedfrequencyspectrumdatafrom:

HumanVoices:3samplesofthesameMartinLutherKingJr. speechexcerptrecordedbydifferenthumanspeakers

AI Voices:5differentAIsynthesissystemsgeneratingthe samespeechcontent:

 AppleTranslate(malevoice)

 GoogleTranslate(malevoice)

 ElevenLabsAI

 MurfLabsAI

 NaturalReadersAI

RecordingSetup:Standardizedrecordingenvironmentwith fixedmicrophoneposition,consistentplaybackdevice(iPad) andvolumesettings,stereoaudiorecordingwithidentical equalizersettings,andfrequencyanalysisperformedusing audacityspectralanalysis[22].

Test Content: All samples used the same speech excerpt from Martin Luther King Jr.'s "I Have a Dream" speech: "I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight; 'and the glory of the Lord shall be revealed and all flesh shall see it together.' This is our hope. This is the faith that I go back to the South with."

3.2 Spectral Feature Extraction

Foreachvoicesample,weextractedthefollowingspectral features using established signal processing techniques [23][24]:

1. SpectralCentroid (Hz):Calculatedastheweighted mean of frequencies Centroid=Σ(fi ×Pi)/Σ(Pi)

2. Spectral Bandwidth (Hz): Standard deviation of frequencies around the centroid Bandwidth = √[Σ((fi -Centroid)²×Pi)/Σ(Pi)]

3. SpectralRolloff (Hz):Frequencybelowwhich85% ofenergyiscontained

4. Skewness:Asymmetryofthespectraldistribution Skewness=Σ((fi -Centroid)³×Pi)/(Bandwidth³× Σ(Pi))

5. Kurtosis:"Tailedness"ofthespectraldistribution (excess kurtosis)

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net p-ISSN: 2395-0072

Kurtosis=Σ((fi -Centroid)⁴×Pi)/(Bandwidth⁴× Σ(Pi))-3

6. Shannon Entropy (bits): Information-theoretic measure of spectral complexity Entropy=-Σ(Pi ×log₂(Pi))

7. High-Frequency Ratio (%):Proportionofenergy above3000Hz

3.3 Statistical Analysis

We performed comparative statistical analysis using descriptivestatistics(mean,standarddeviation,range),effect size calculation using Cohen's d, and pattern recognition throughfeaturecomparison[25].

Cohen'sdwascalculatedas:

d=|μhuman-μ_AI|/√[(σhuman²+σ_AI²)/2]

Whereeffectsizesareinterpretedas:small(d>0.2),medium (d>0.5),andlarge(d>0.8)[26].

4. RESULTS

4.1 Individual Voice Characteristics

Human Voice Samples

Table-1:SpectralanalysisofHumanVoices

AI Voice Samples

Table-2:SpectralanalysisofAIVoices

Table1showingspectralanalysisofHumanVoices

4.2 Statistical Comparison Analysis

Table-3:Statisticalcomparisonofsoundcharacteristics

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net p-ISSN: 2395-0072

4.3 Key Distinguishing Features

Strong Distinguishing Features (Cohen's d > 0.8):

1. Spectral Rolloff (d=0.847):Humanvoicesshow significantly higher rolloff frequencies, indicating broaderfrequencyutilization

2. Skewness (d = 0.891): AI voices exhibit higher positive skewness, suggesting more pronounced low-frequencyconcentration

Moderate Distinguishing Features (0.5 < d ≤ 0.8):

1. ShannonEntropy (d=0.752):Humanvoicesshow higherentropy,indicatingmorespectralcomplexity

2. Kurtosis (d = 0.524): AI voices display higher kurtosis, suggesting more peaked spectral distributions

3. High-Frequency Ratio (d=0.521):Humanvoices containmorehigh-frequencycontent

5. DISCUSSION

5.1 Pattern Analysis

Theresultsrevealseveraldistinctpatternsthatdifferentiate human from AI-generated voices, though some findings require careful interpretation due to methodological considerations:

Human Voice Characteristics:

 Broader Frequency Utilization: Higher spectral rolloff values (4206 Hz average) indicate human voices effectively utilize a wider frequency range [27]

 Higher Spectral Complexity:Increased Shannon entropy (6.387 bits average) may indicate more natural variation and complexity in spectral content, though this finding must be interpreted cautiously due to recording methodology limitations(seeSection5.4)

 MoreHigh-FrequencyContent:Higherproportion ofenergyabove3000Hz(29.78%average)could indicate natural breathing sounds and acoustic characteristicsofhumanvocaltracts[28],butmay alsoreflectrecordingchaindifferences

AI Voice Characteristics:

 Frequency Concentration: Lower rolloff frequenciessuggestAIsystemstendtoconcentrate energyinspecificfrequencybands



HigherSpectralSkewness:Morepronouncedlowfrequencybias(2.284averageskewness)indicates algorithmicpreferenceforfundamentalfrequency regions

Reduced Spectral Randomness: Lower entropy values suggest more predictable, less naturally varyingspectralpatterns

 LimitedHigh-FrequencyContent:Reducedenergy in higher frequencies may indicate insufficient modeling of natural human vocal characteristics [29]

5.2 AI System Variability

Different AI systems showed significant variation in their spectralcharacteristics:

 GoogleTranslateandMurfLabs showedthemost artificialpatternswithverylowrollofffrequencies andhighkurtosisvalues

 ElevenLabs demonstrated the most human-like characteristicsamongAIsystems,withhighentropy andextensivehigh-frequencycontent

 Natural Readers showed mixed characteristics, performing well in some features while showing clearAIsignaturesinothers

This variability suggests that detection methods must be robustenoughtohandledifferentAIsynthesisapproaches [30].

5.6 Classification Potential

Based on the effect size analysis, spectral analysis demonstrates moderatetostrongclassificationpotential withanestimatedaccuracyof65-75%.Thecombinationof multiplefeatures,particularlyspectralrolloffandskewness, providesrobustdistinguishingcapabilities.

Recommended Classification Approach:

1. Primary Features: Spectral rolloff and skewness (largeeffects)

2. Supporting Features:Shannonentropyandhighfrequencyratio(mediumeffects)

3. Threshold-Based Classification:Usingstatistical boundaries derived from the observed feature distributions

5.7 Implications for Voice Synthesis Technology

Thefindingshaveimportantimplicationsforbothdetection and synthesis technologies. For detection systems, the

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net p-ISSN: 2395-0072

identified spectral features provide a foundation for realtimeauthenticationmethods[31].Forsynthesisdevelopers, theresultshighlightareaswherecurrentAIsystemscould be improved to better mimic natural human speech characteristics.

6. LIMITATIONS

Severallimitationsshouldbeconsideredwheninterpreting theseresults:

1. Recording Chain Methodology: Primary Limitation - Different recording paths forhuman (direct microphone recording) versus AI voices (speaker playback → microphone recording) may introduceconfoundingvariablesaffectingentropy and high-frequency content measurements. This methodological differencecouldartificiallyinflate complexity measures for human voices due to recording artifacts rather than inherent speech characteristics.

2. SampleSize:Limitedto3humansamplesand5AI systems,whichmaynotrepresentthefulldiversity ofvoicesandsynthesistechnologies

3. Content Consistency: All samples used identical speech content, which may not represent natural variabilityinhumanspeechpatterns

4. Single Speaker Reference: Human samples representlimitedindividualvoicecharacteristics

5. Temporal Analysis: Current analysis focuses on frequencydomainwithoutincorporatingtemporal dynamics[32]

6. EnvironmentalFactors:Recordingconditionsand acoustic environment may influence spectral characteristics beyond the recording chain differences

7. Technology Evolution: AI synthesis systems are rapidlyevolving,potentiallychangingtheirspectral signaturesovertime[33]

2. Human voices show broader frequency utilization comparedtoAIsystems,asevidenced byconsistentlyhigherspectralrolloffvalues

3. AI voices exhibit characteristic patterns including frequency concentration and skewness toward lower frequencies, likely reflectingalgorithmicsynthesisapproaches

4. Entropyandhigh-frequencycontentdifferences, while supporting the hypothesis of greater humanvoicecomplexity,requirevalidationwith controlled recording methodologies toseparate inherent voice characteristics from recording artifacts

5. Different AI systems show varying degrees of human-like characteristics, with some approachinghumanspectralpatternsmoreclosely thanothers

Theresearchprovidesqualifiedsupportforthehypothesis that human voices demonstrate more "free-flowing frequencypatterns"comparedtoAIsystems,withthecaveat that some complexity measures may be influenced by recordingmethodologydifferences.

7.1 Practical Implications

Thesefindingssuggestthatspectralanalysiscouldserveas:

 Arapidscreeningtoolforvoiceauthenticity

 Acomponentofmulti-modaldetectionsystems

 A baseline for monitoring AI voice synthesis improvements

 A foundation for developing real-time detection algorithms

7.2 Future Research Directions

CONCLUSIONS

Thisstudydemonstratesthat spectralanalysiscanprovide meaningful discrimination between AI-generated and human voices, though methodological considerations require careful interpretation of some findings. Key conclusionsinclude:

1. Spectral rolloff and skewness emerge as the most robust distinguishing features with large effect sizes (d > 0.8), being less susceptible to recordingchainartifacts

1. Controlled Recording Methodology: PriorityDirectcomparisonoforiginaldigitalAIoutputswith original human recordings to eliminate recording chain confounds and isolate inherent voice characteristics

2. Expanded Dataset: Analysis with larger, more diverse voice samples across different languages anddemographics

3. Temporal Dynamics:Integrationoftime-domain spectralchangesandprosodicfeatures[34]

4. AdvancedAISystems:Testingagainststate-of-theart voice synthesis models including recent transformer-basedapproaches

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net p-ISSN: 2395-0072

5. Real-World Validation: Testing under various acoustic conditions and content types with controlledrecordingprotocols

6. Machine Learning Integration: Combining spectral features with automated classification algorithms[35]

7. Cross-Language Analysis: Investigating spectral differencesacrossdifferentlanguagesandaccents

The evidence indicates that while AI voice synthesis continuestoimprove,fundamentalspectralsignaturesstill distinguishartificialfromhumanspeech,providingaviable foundationfordetectionsystems.

ACKNOWLEDGEMENT

The authors acknowledge the use of audacity for spectral analysis and thank all contributors who provided voice samplesforthisresearchstudy.

REFERENCES

[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "WaveNet: A generative model for raw audio," arXiv preprint arXiv:1609.03499,2016.

[2] A. Nautsch, X. Wang, N. Evans, T. H. Kinnunen, V. Vestman, M. Todisco, H. Delgado, M. Sahidullah, J. Yamagishi, and K. A. Lee, "ASVspoof 2019: spoofing countermeasures for the detection of synthesized, convertedandreplayedspeech," IEEE Transactions on Biometrics, Behavior, and Identity Science,vol.3,no.2, pp.252-265,2021.

[3] X. Wang and J. Yamagishi, "A comparative study on recent neural spoofing countermeasures for synthetic speechdetection,"in Proc. Interspeech 2020,pp.42594263,2020.

[4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "WaveNet: A generative model for raw audio," arXiv preprint arXiv:1609.03499,2016.

[5] Y.Wang,R.Skerry-Ryan,D.Stanton,Y.Wu,R.J.Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, "Tacotron: Towards end-to-end speech synthesis," in Proc. Interspeech 2017,pp.4006-4010,2017.

[6] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, and H. Delgado, "ASVspoof 2021: accelerating progressinspoofedanddeepfakespeechdetection,"in Proc. ASVspoof 2021 Workshop,pp.47-54,2021.

[7] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, "librosa: Audio and music signal analysis in python," in Proceedings of the 14th python in science conference,vol.8,pp.18-25,2015.

[8] G. Peeters, "A large set of audio features for sound description," CUIDADO IST Project Report,2004.

[9] A. Lerch, "An introduction to audio content analysis: Applications in signal processing and music informatics,"JohnWiley&Sons,2012.

[10] T. Giannakopoulos and A. Pikrakis, "Introduction to audioanalysis:aMATLABapproach,"AcademicPress, 2014.

[11] D.Povey,A.Ghoshal,G.Boulianne,L.Burget,O.Glembek, N.Goel,M.Hannemann,P.Motlicek,Y.Qian,P.Schwarz, J.Silovsky,G.Stemmer,andK.Vesely,"TheKaldispeech recognition toolkit," in IEEE 2011 workshop on automatic speech recognition and understanding,2011.

[12] C. E. Shannon, "A mathematical theory of communication," The Bell System Technical Journal,vol. 27,no.3,pp.379-423,1948.

[13] M. G. Kendall and A. Stuart, "The advanced theory of statistics,volume1:Distributiontheory,"Griffin,1977.

[14] T.M.CoverandJ.A.Thomas,"Elementsofinformation theory,"JohnWiley&Sons,2012.

[15] S. Furui, "Digital speech processing, synthesis, and recognition,"CRCPress,2000.

[16] H. Tak, M. Todisco, X. Wang, J. Yamagishi, N. Evans, "SingFake:SingingVoiceDeepfakeDetection,"in Proc. ICASSP 2021,pp.1375-1379,2021.

[17] R. K. Das, J. Yang, and H. Li, "Assessing the scope of generalizedcountermeasuresforanti-spoofing,"in Proc. ICASSP 2020,pp.6589-6593,2020.

[18] L. Chen, W. Guo, L. Dai, "Speaker verification against syntheticspeech,"in Proc. Speaker Odyssey 2018,pp.3135,2018.

[19] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y.-H. Chang, S. Tsao,G.Hua,L.Tan,Y.Qian,K.Yu,Z.Ma,H.Zhu,T.Que, P.Wang,C.Shan,W.Rao,C.Wang,J.Jiang,S.Ronanki,W. Wang, M. White, E. Cooper, Y. Yamamoto, Y. Qian, R. Huang, S. Jia, T. Tang, Z. Lv, T. Song, S. Liu, H. Zou, S. Chen,Y.Chen,andW.Shi,"ASVspoof2019:Alarge-scale publicdatabaseofsynthesized,convertedandreplayed speech," Computer Speech & Language,vol.64,101114, 2020.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 07 | Jul 2025 www.irjet.net p-ISSN: 2395-0072

[20] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, "ASVspoof 2019: Future horizonsinspoofedandfakeaudiodetection,"in Proc. Interspeech 2019,pp.1008-1012,2019.

[21] N.Müller,P.Czempin,F.Dieckmann,A.Froghyar,andK. Böttger,"Doesaudiodeepfakedetectiongeneralize?,"in Proc. Interspeech 2022,pp.2783-2787,2022.

[22] Azalia, Aisha & Ramadhanti, Desi & Ana, Hesti & Kuswanto,Heru.(2022).AudacitySoftwareAnalysisIn AnalyzingTheFrequencyAndCharacterOfTheSound Spectrum.JurnalPenelitianPendidikanIPA.8.177-182. 10.29303/jppipa.v8i1.913.

[23] J.ProakisandD.Manolakis,"Digitalsignalprocessing: principles, algorithms, and applications," Pearson PrenticeHall,4thedition,2007.

[24] A.V.OppenheimandR.W.Schafer,"Discrete-timesignal processing,"PrenticeHall,3rdedition,2009.

[25] J.Cohen,"Statisticalpoweranalysisforthebehavioral sciences," Lawrence Erlbaum Associates, 2nd edition, 1988.

[26] G.SullivanandR.Feinn,"Usingeffectsize orwhytheP value is not enough," Journal of Graduate Medical Education,vol.4,no.3,pp.279-282,2012.

[27] L.R.RabinerandR.W.Schafer,"Theoryandapplications ofdigitalspeechprocessing,"Pearson,2010.

[28] K.N.Stevens,"Acousticphonetics,"MITPress,1998.

[29] Y.Stylianou,"Voicetransformation:asurvey,"in Proc. ICASSP 2009,pp.3585-3588,2009.

[30] H. Dinkel, S. Chen, K. Qian, and B. Schuller, "Towards end-to-endspoofingcountermeasures:aresidual-based approach," in Proc. Interspeech 2017, pp. 4048-4052, 2017.

[31] T.KinnunenandH.Li,"Anoverviewoftext-independent speaker recognition: From features to supervectors," Speech Communication,vol.52,no.1,pp.12-40,2010.

[32] Z.Wu,S. Gao,E.S.Cling,andH.Li,"Astudyonreplay attack and anti-spoofing for automatic speaker verification,"in Proc. Interspeech 2014,pp.92-96,2014.

[33] J.Shen,R.Pang,R.J.Weiss,M.Schuster,N.Jaitly,Z.Yang, Z.Chen,Y.Zhang,Y.Wang,R.Skerrv-Ryan,R.A.Saurous, Y.Agiomyrgiannakis,andY.Wu,"NaturalTTSsynthesis by conditioning WaveNet on mel spectrogram predictions,"in Proc. ICASSP 2018,pp.4779-4783,2018.

[34] J. Hirschberg and C. D. Manning, "Advances in natural language processing," Science, vol. 349, no. 6245, pp. 261-266,2015.

[35] I. Goodfellow, Y. Bengio, and A. Courville, "Deep learning,"MITPress,2016.

[36] A.Cutler,D.Dahan,andW.vanDonselaar,"Prosodyin the comprehension of spoken language: A literature review," Language and Speech, vol. 40, no. 2, pp. 141201,1997.

[37] F. Itakura, "Minimum prediction residual principle applied to speech recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing,vol.23,no.1,pp. 67-72,1975.