Predicting COVID-19 Hotspots Using Google Trends: A Correlation Analysis of Search Terms and Case Da

Page 1


International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

Predicting COVID-19 Hotspots Using Google Trends: A Correlation Analysis of Search Terms and Case Data in India

Garg 1

1Student, Grade 11, The Shri Ram School Moulsari, Gurugram, Haryana, India ***

Abstract - This study investigates the potential of Google Trends data as a predictive tool for identifying COVID-19 hotspots by analyzing the correlation between search term frequencies and confirmed case data across five Indian states during 2020-2021. Using Python-based correlation analysis, we examined 10 COVID-19-related keywords against daily and weekly case data from Andhra Pradesh, Delhi, Maharashtra, Uttar Pradesh, andKerala. Our findings reveal distinct patterns between the pandemic's initial year anditssecondyear:in2020,termslike"Antibody"and"Loss of smell" showed the highest correlations, while 2021 demonstrated stronger positive correlations across most search terms. These results suggest that Google Trends data could serve as an early warning system for epidemic surveillance, though initial outbreak periods may introduce noise that affects predictive accuracy. This research contributes to the growing field of digital epidemiology and offersinsightsforpublichealthpreparednessstrategies.

Key Words: Google Trends, COVID-19, epidemiological surveillance, correlation analysis, India, digital epidemiology

1.INTRODUCTION

1.1 Background

The COVID-19 pandemic has highlighted the critical importance of early detection and prediction systems for managing public health crises. As traditional epidemiological surveillance methods often lag behind real-time disease progression, researchers have increasinglyturnedtodigitaldatasourcesformoretimely insights. Among these, Google Trends has emerged as a particularly valuable tool for understanding public health concernsandpotentiallypredictingdiseaseoutbreaks.

Previousresearchhasestablishedsignificantrelationships between Google Trends data and COVID-19 case patterns across various geographical contexts. Ayyoubzadeh et al. (2020)analyzedcorrelationsbetweenGoogleTrendsdata fortheterm"COVID-19"andconfirmedcasesacrosseight countries,includingtheUnitedStates,Spain,Italy,France, the United Kingdom, China, Iran, and India, finding positivecorrelationsacrossallregions.Similarly,Prasanth et al. (2021) employed multiple modeling approaches, including Linear, Negative Binomial, and Deep Neural

Networkmodels,toanalyzerelationshipsbetweenGoogle Trendsdata for13keywordsandconfirmedCOVIDcases, with deep neural networks yielding the most accurate predictions.

Further studies have explored the nuances of these correlations. Ciufolini and Paolozzi (2020) conducted a comprehensive analysis across multiple countries from different continents, discovering that queries related to symptoms (particularly "fever" and "loss of smell") and terms explicitly mentioning COVID showed the highest correlations with case data. However, Cervellin et al. (2021) highlighted important methodological considerations, demonstrating that Google Trends data reliability varies significantly by geographical scale, proving most reliable for country-level analyses rather thanregionalcomparisons.

Walker et al. (2020) examined lag and lead correlations between 10 keywords and COVID-19 cases across U.S. states, finding that terms like "face mask," "Lysol," and "COVID stimulus check" showed the strongest predictive correlationswhenanalyzedwithtemporaloffsets.Jokicet al. (2021) extended this work to Croatia, exploring both epidemiological predictions and socio-psychological consequences of the pandemic through search behavior. Notably, Sousa-Pinto et al. (2020) found that COVID-19relatedGoogleTrendsdataoftencorrelatedmorestrongly with media coverage than with actual epidemic trends, particularly for terms like "anosmia" and "ageusia," highlighting the complex interplay between public awarenessanddiseaseprevalence.

1.2 Knowledge Gap

While existing research has demonstrated correlations between Google Trends data and COVID-19 cases, most studies have focused on either single time periods or limited geographical regions. A critical gap remains in understanding whether these correlations remain consistent across different phases of the pandemic and whether Google Trends data can reliably predict future epidemic hotspots, particularly in developing countries withdiverseregionalcharacteristics.Thisstudyaddresses this gap by examining correlations across multiple Indian statesovertwodistinctpandemicyears.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

1.3 Research Significance

The ability to predict epidemic hotspots using readily available digital data could revolutionize public health preparedness. Early identification of potential outbreak areas would enable healthcare systems to allocate resources more effectively, allowing hospitals to prepare equipment and personnel in advance. Additionally, governments could implement targeted interventions and preventive measures, potentially reducing disease transmissionandsavinglives.Thisresearchisparticularly relevant for resource-constrained settings where traditionalsurveillanceinfrastructuremaybelimited.

1.4

Study Overview

ThisstudyemployedPython-basedcorrelationanalysisto examinerelationshipsbetweenGoogleTrendssearchdata and COVID-19 case numbers across five Indian states during2020and2021.Weanalyzedbothdailyandweekly intervals to assess temporal granularity effects on correlation strength. Our methodology involved creating correlation heatmaps to visualize relationships between 10 carefully selected COVID-19-related keywords and activecasedata.

Our results reveal distinct patterns: in 2020, "Antibody" and "Loss of smell" showed the highest correlations with case data, while 2021 demonstrated broadly positive correlations across most search terms. These findings suggest that Google Trends data may serve as a valuable toolfor epidemic prediction,though careful consideration must be given to the unique characteristics of initial outbreakperiods.

2. METHODOLOGY

2.1

Research Aim

This study aimed to answer a fundamental question: Can Google Trends data be used to predict future epidemic hotspots? Specifically, we sought to determine whether search behavior patterns correlate with COVID-19 case data consistently enough to serve as an early warning systemfordiseaseoutbreaks.

2.2 Data Collection

We collected and analyzed correlation data between Google Trends search frequencies and COVID-19 active case numbers across five Indian states: Andhra Pradesh, Delhi, Maharashtra, Uttar Pradesh, and Kerala. Data was collected for both 2020 and 2021 to capture different phases of the pandemic. To assess the impact of temporal granularity on correlation strength, we analyzed data at both daily and weekly intervals, creating correlation heatmapstovisualizerelationshipsbetweenvariables.

2.3 Data Sources

2.3.1 Google Trends Data

We utilized Google Trends (https://trends.google.com/trends/) to collect search frequencydataforthefollowing10keywords:

1. Antibody - Selected to capture public interest in immunitytesting

2. Coronavirus-Generaltermforthevirus

3. Coronavirus Symptoms - To track symptomrelatedsearches

4. Coronavirus Vaccine - Monitoring vaccine-related interest

5. Covid-19-Officialdiseasedesignation

6. Covid Symptoms - Alternative symptom search term

7. FaceMask-Preventivemeasuresearches

8. Fever-PrimaryCOVID-19symptom

9. LossofSmell-DistinctiveCOVID-19symptom

10. Combined symptoms (Sore Throat + Shortness of Breath + Fatigue + Cough) - Composite symptom searches

Thesekeywordswereselectedbasedontheirrelevanceto the COVID-19 pandemic and their popularity in search queries,ensuringcomprehensivecoverageofbothdisease awarenessandsymptom-relatedsearches.

2.3.2 COVID-19 Case Data

COVID-19 case data was obtained from PRS India (https://prsindia.org/covid-19/cases), which provides comprehensive, state-wise daily updates on confirmed cases,activecases,recoveries,anddeaths.Thissourcewas chosen for its reliability and consistent data reporting methodologyacrossallIndianstates.

2.4 Data Analysis

2.4.1 Data Processing

Data alignment was crucial for accurate correlation analysis. We synchronized dates between Google Trends and COVID-19 datasets, ensuring temporal consistency. For weekly interval analysis, we aggregated both COVID19casenumbersandGoogleTrendssearchvolumesusing Python, calculating weekly sums to reduce daily fluctuationsandrevealunderlyingtrends.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

2.4.2 Statistical Analysis

We employed Python for all data processing and analysis tasks,utilizingthefollowinglibraries:

● Pandas:Fordatamanipulationandalignment

● NumPy:Fornumericalcomputations

● Matplotlib and Seaborn: For creating correlation heatmaps

● SciPy:Forcalculatingcorrelationcoefficients

Pearson correlation coefficients were calculated between each Google Trends keyword and active COVID-19 cases foreachstate,generatingseparateheatmapsfor2020and 2021,aswellasfordailyandweeklyintervals.

2.4.3 Rationale for Analytical Methods

Python was selected as our primary analytical tool due to its robust data science libraries and visualization capabilities. Correlation heatmaps were chosen for their ability to simultaneously display multiple variable relationships, making pattern identification intuitive and visually accessible. This approach facilitates quick identification of strong correlations and enables comparative analysis across different states and time periods.

3. RESULTS

3.1 Overall Patterns

3.1.1 Year 2020 Results

During the pandemic's first year, correlation patterns showedconsiderablevariability:

● "Antibody" and "Loss of smell" consistently demonstrated the highest positive correlations acrossmoststates

● Initial pandemic months (March-April 2020) showedspikeinsearchvolumesthatdidnotalign with case numbers, likely due to heightened publicawarenessprecedingactualcasesurges

● Daily and weekly interval analyses showed similar patterns, with weekly data providing slightlysmoothercorrelationvalues

3.1.2

Year 2021 Results

Thesecondyearshowedmarkedlydifferentpatterns:

● Most search terms exhibited strong positive correlationswithCOVID-19cases

● Correlation values were generally higher and moreconsistentthanin2020

● The relationship between search behavior and actualcasesappearedmoresynchronized

3.2 State-Specific Findings

3.2.1

Andhra Pradesh

● 2020:Moderatecorrelations,with"Antibody"(r≈ 0.65) and "Loss of smell" (r ≈ 0.58) showing strongestrelationships

● 2021:Strongpositivecorrelationsacrossallterms (r>0.7formostkeywords)

3.2.2 Delhi

● 2020: Variable correlations, with symptomrelatedsearchesshowingstrongerrelationships

● 2021: Consistently high correlations, particularly for"Covid-19"and"Coronavirus"terms

3.2.3 Maharashtra

● 2020: Similar to overall pattern, with "Antibody" searchesshowinghighestcorrelation

● 2021: Very strong correlations across all search terms, suggesting synchronized public awareness anddiseasespread

3.2.4

Uttar Pradesh

● 2020: Moderate correlations with significant variability

● 2021: Improved correlation strength across all keywords

3.2.5 Kerala

● Notableexceptiontogeneralpatterns

● 2020: Weak or negative correlation for "Antibody"searches

● 2021: Relatively weaker positive correlations comparedtootherstates

● Geographic location in southern India may contributetodifferentsearchbehaviorpatterns

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

3.3 Temporal Interval Analysis

Comparisonbetweendailyandweeklyintervalsrevealed:

● Weekly aggregations provided marginally more stablecorrelationvalues

● Overall patterns remained consistent regardless oftemporalgranularity

● Weekly analysis effectively reduced noise from day-to-dayfluctuations

3.4 Correlation Heatmaps

Figure1:CorrelationHeatmapofAndhraPradeshDaily (2020)

Figure2:CorrelationHeatmapofAndhraPradeshWeekly (2020)

Figure3:CorrelationHeatmapofAndhraPradeshDaily (2021)

Figure4:CorrelationHeatmapofAndhraPradeshWeekly (2021)

Figure5:CorrelationHeatmapofDelhiDaily(2020)

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Figure6:CorrelationHeatmapofDelhiWeekly(2020)

Figure7:CorrelationHeatmapofDelhiDaily(2021)

Figure8:CorrelationHeatmapofDelhiWeekly(2021)

Figure9:CorrelationHeatmapofMaharashtraDaily (2020)

Figure10:CorrelationHeatmapofMaharashtraWeekly (2020)

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072 © 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008

Figure11:CorrelationHeatmapofMaharashtraDaily (2021)

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

Figure12:CorrelationHeatmapofMaharashtraWeekly (2021)

Figure13:CorrelationHeatmapofUttarPradeshDaily (2020)

Figure14:CorrelationHeatmapofUttarPradeshWeekly (2020)

Figure15:CorrelationHeatmapofUttarPradeshDaily (2021)

Figure16:CorrelationHeatmapofUttarPradeshWeekly (2021)

Figure17:CorrelationHeatmapofKeralaDaily(2020)

IRJET |

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

Figure18:CorrelationHeatmapofKeralaWeekly(2020)

Figure19:CorrelationHeatmapofKeralaDaily(2021)

Figure20:CorrelationHeatmapofKeralaWeekly(2021)

4. DISCUSSION

4.1

Interpretation of Correlation Patterns

The observed correlation patterns suggest a complex relationship between public search behavior and COVID-

19 case dynamics. The stronger correlations in 2021 comparedto2020likelyreflectseveralfactors:

1. Learning Effect: By 2021, the public had developed a better understanding of COVID-19 symptoms and when to seek information, leading tomoresynchronizedsearchpatternswithactual caseoccurrences.

2. Media Influence: The initial 2020 surge in searches, particularly in March when COVID-19 was declared a pandemic, created a disconnect betweensearchvolumesandactualcasenumbers. This aligns with Sousa-Pinto et al.'s (2020) findings that Google Trends data often correlates morewithmediacoveragethanepidemictrends.

3. Symptom Awareness: The high correlations for "Antibody" and "Loss of smell" in 2020 suggest that as the pandemic progressed, people became more concerned with specific symptoms and testing,ratherthangeneralinformationaboutthe virus.

4.2 Geographic Variation

Kerala's divergent pattern deserves special attention. Severalfactorsmayexplainthisanomaly:

● HigherHealthLiteracy:Kerala hasIndia'shighest literacy rate and well-developed healthcare infrastructure, potentially leading to different information-seekingbehaviors

● Early Response: Kerala's proactive pandemic response may have altered the typical relationshipbetweensearchesandcases

● Geographic Isolation: Being in southern India, Kerala may have experienced different disease transmissionpatterns

4.3 Limitations

Several limitations must be acknowledged in interpreting theseresults:

1. Initial Surge Bias: The March 2020 search surge, drivenbyglobalpandemicdeclarationratherthan local cases, introduces noise particularly problematicfor2020analyses.Thishighlightsthe importance of considering media influence when using Google Trends for epidemiological predictions.

2. Keyword Selection: While we selected popular and relevant terms, the choice of keywords inherently limits the scope of analysis. Other

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

relevant terms or local language searches were notincluded.

3. DataGranularity:GoogleTrendsprovidesrelative search volumes rather than absolute numbers, which may mask important variations in actual searchbehavior.

4. Confounding Variables: We did not account for factors such as internet penetration rates, demographic differences, or varying healthcare access across states, which could influence both searchbehaviorandcasereporting.

5. Outlier Management: We did not systematically remove outliers from our datasets, which might have improved correlation accuracy but could alsorisklosingimportantepidemicsignals.

4.4 Implications for Public Health

Despite limitations, our findings have significant implicationsforpublichealthpractice:

1. Early Warning Systems: The strong correlations observed, particularly in 2021, suggest that Google Trends could complement traditional surveillance systems, providing early signals of increasingdiseaseactivity.

2. Resource Allocation: Hospitals and healthcare systemscouldmonitorsearchtrendstoanticipate surges in patient loads, allowing proactive preparation of equipment and staffing.

3. Targeted Interventions: Governments could use searchdatatoidentifyareasofincreasingconcern and implement targeted public health measures, including testing campaigns or movement restrictions.

4. Public Communication: Understanding search patterns can help health authorities tailor their communication strategies to address public concernsandinformationneeds.

3. Integrate Multiple Data Sources: Combine Google Trends with other digital data sources (social media, mobility data) for comprehensive surveillance

4. Establish Baselines: Create location-specific baseline search patterns to better identify anomalies

5. Real-time Monitoring: Develop automated systems to track and alert on significant changes insearchpatterns

4.6 Future Research Directions

Thisstudyopensseveralavenuesforfutureinvestigation:

1. Expanded Geographic Scope: Analyze all Indian states to identify national patterns and regional variations

2. Advanced Modeling: Apply machine learning techniques, including neural networks and ensemble methods, to improve predictive accuracy

3. Multi-language Analysis: Include searches in regional languages to capture more comprehensivesearchbehavior

4. Cross-pandemic Validation: Apply similar methodology to other epidemics (dengue, influenza)tovalidatetheapproach

5. TemporalDynamics:Investigateoptimallagtimes between search queries and case occurrence for maximumpredictivevalue

6. Demographic Segmentation: If available, analyze age and gender-specific search patterns to understandpopulation-specificrisks

5. CONCLUSIONS

4.5

Recommendations for Implementation

To effectively utilize Google Trends for epidemic prediction:

1. Develop Composite Indices: Combine multiple search terms to create more robust predictive indicators

2. AccountforMediaEffects:Implementmethodsto filteroutmedia-drivensearchspikes

This study demonstrates that Google Trends data holds promise as a supplementary tool for predicting COVID-19 hotspots, though its effectiveness varies by time period and geographic location. Our analysis of five Indian states over 2020-2021 revealed that search behavior patterns, particularly for terms like "Antibody" and "Loss of smell" in2020,and mostCOVID-relatedtermsin2021,correlate significantlywithactivecasenumbers.

The stronger correlations observed in 2021 suggest that as public understanding of the pandemic matured, search behavior became more predictive of actual disease patterns.ThisfindingunderscoresthepotentialforGoogle Trends to serve as an early warning system for epidemic

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN:2395-0072

surveillance, particularly after initial outbreak periods whenmedia-drivensearchesstabilize.

However,thegeographicvariationsobserved,particularly Kerala's divergent patterns, remind us that local factors significantly influence the relationship between search behavior and disease prevalence. Future epidemic preparedness strategies should consider integrating Google Trends data with traditional surveillance methods while accounting for local contexts and potential confoundingfactors.

Aswefacethepossibilityoffuturepandemics,developing robust digital epidemiology tools becomes increasingly critical. This research contributes to that goal by demonstrating both the potential and limitations of using search data for disease prediction. With further refinement and integration with other data sources, Google Trends could become a valuable component of comprehensive epidemic early warning systems, ultimately helping save lives through improved preparednessandresponse.

ACKNOWLEDGEMENT

I extend my sincere gratitude to Google Trends for providing open access to search data that made this analysis possible. I also thank PRS India for maintaining comprehensive COVID-19 case data throughout the pandemic, enabling researchers to conduct vital epidemiological studies. This research would not have been possible without these publicly available datasets andtheorganizations'commitmenttodatatransparency.

REFERENCES

[1] Ayyoubzadeh, S. M., Ayyoubzadeh, S. M., Zahedi, H., Ahmadi, M., & Kalhori, S. R. N. (2020). Predicting COVID-19 incidence through analysis of Google Trends data in Iran: Data mining and deep learning pilot study. JMIR Public Health and Surveillance, 6(2), e18828. https://link.springer.com/article/10.1007/s10916020-01588-5

[2] Cervellin, G., Comelli, I., & Lippi, G. (2021). Is Google Trends a reliable tool for digital epidemiology? Insights from different clinical settings. Frontiers in Research Metrics and Analytics, 6, 670226. https://www.frontiersin.org/journals/researchmetrics-andanalytics/articles/10.3389/frma.2021.670226/full

[3] Ciufolini, I., & Paolozzi, A. (2020). An improved mathematical prediction of the time evolution of the Covid-19 pandemic in Italy, with a Monte Carlo simulationanderroranalyses. International Journal of Environmental Research and Public Health, 19(19), 12394.

https://www.mdpi.com/16604601/19/19/12394

[4] Google Trends. (2021). Search trends data. https://trends.google.com/trends/

[5] Jokic,D.,Turk,T.,&Petrovic,M.(2021).GoogleTrends asamethodtopredictnewCOVID-19casesandsociopsychological consequences of the pandemic. EconStor https://www.econstor.eu/handle/10419/235602

[6] Prasanth, S., Singh, U., Kumar, A., Tikkiwal, V. A., & Chong, P. H. (2021). Forecasting spread of COVID-19 using Google Trends: A hybrid approach. IEEE Access, 9, 37785-37795. https://ieeexplore.ieee.org/abstract/document/9377 852

[7] PRS India. (2021). COVID-19 cases in India https://prsindia.org/covid-19/cases

[8] Sousa-Pinto, B., Anto, A., Czarlewski, W., Anto, J. M., Fonseca,J.A.,&Bousquet,J.(2020).Assessmentofthe impactofmediacoverageonCOVID-19-relatedGoogle Trends data: Infodemiology study. Journal of Medical Internet Research, 22(8), e19611. https://www.jmir.org/2020/8/e19611/

[9] Walker, A., Hopkins, C., & Surda, P. (2020). Use of Google Trends to investigate loss-of-smell-related searches during the COVID-19 outbreak. Mayo Clinic Proceedings, 95(9), 1904-1912. https://www.mayoclinicproceedings.org/article/S00 25-6196(20)30934-4/fulltext

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.