Skip to main content

SynthoMed AI: Generating Synthetic Medical Record via an AI Chat bot

Page 1


International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

SynthoMed AI: Generating Synthetic Medical Record via an AI Chat bot

SURYA G 1 , MOHAMMED RASOOL R 2 , SHRINANDA HK 3, MOHAMMED FLAH LAHORI 4

5 Assit.Prof: Mary Anitha T 1,2,3,4 Dept. of Artificial Intelligence & Machine Learning Engineering, The Oxford College of Engineering, Bommanahalli, Bengaluru-68, Karnataka, India

5 Dept. of AIML Engineering, the Oxford College of Engineering, Bommanahalli, Bengaluru-68, Karnataka, India ***

Abstract - Access to authentic medical records is an essential part of healthcare education, yet it is often restricted due to strict patient privacy regulations and ethical concerns. This paper is presenting a novel going to addresses the challenges by AI-driven chat bot system designed to generate high-fidelity synthetic medical records. The project methodology encompasses a complete development lifecycle, including data pre-processing, AI model training, chat bot integration, and final deployment. By leveraging Natural Language Processing (NLP) and probabilistic sampling, the system allows students and healthcare learners to query and generate realistic patient data scenarios without risking the exposure of sensitive personal information. The results demonstrate a functional, interactive tool that democratizes access to medical data for training purposes. Furthermore, the development process highlights the efficacy of combining deep learning with conversational interfaces to solve practical challenges in health informatics and technical education.

Key Words: Synthetic, AI Chat bot, Synthetic Medical Data(SMD), Healthcare Privacy, Generative Adversarial Networks (GANs), Natural Language Toolkit (NLTK), Synthetic Data Generation (SDG), Medical Informatics, Health Insurance Portability and Accountability Act (HIPAA), Personally Identifiable Information (PII), Electronic Health Records (EHR), Natural Language Processing (NLP).

1. INTRODUCTION

1.1 Overview

Thehealthcareindustryisundergoingrapidtransformation,drivensignificantlybyadvancesindataanalysisandartificial intelligence(AI).High-qualitymedicaldataformstheessentialfoundationfornumerouscriticalactivitieslike:trainingfuture healthcareprofessionals,supportingclinicalresearch,anddevelopingrobustAIdiagnostictools.ElectronicHealthRecords (EHRs) have become the digital standard, encapsulating patient histories, laboratory results, and treatment plans in a structuredformat.However,apersistenttensionexistsbetweentheimmensevalueofthisdataandtheparamountneedto protectpatientprivacy.Thispaperintroducesanovelapproachdesignedtonavigatethistension:anAI-poweredchatbot systemnamed Syntho Med AI,whichgeneratesrealisticsyntheticmedicalrecords.Bygeneratingsyntheticpatientprofiles thatreplicatethestatisticalandlinguisticpatternsofrealdata,withoutcontaininganyactualpersonalinformation,thisdevice aimstomakehigh-qualitymedicaldataaccessibletoallforresearch,educationandinnovation,whileupholdingtostrictethical standardsofpatientconfidentiality.

1.2

Problem Statement

Thereal-worldmedicalrecordisrestrictedandaccessforeducationalandresearchpurposes,thuscreatingwell-documented advancements in health care. This limitation arises due to rigorous ethical and legal frameworks, most especially those surrounding the strict protection of Personally Identifiable Information (PII) as through regulations such as the Health InsurancePortabilityandAccountabilityAct(HIPAA).Theseareprotectingpatientrights,buthaveunintendedofcreating formidablebarriersinpathofstudents,researchers,anddevelopers.Theseindividualsrequirerich,realisticdatasetsinorder tohoneanalyticalskills,testnewsoftware.Theresultsareascarcityofpractical,legallycompliantlearningresources.This spacenotonlylimitspracticalexperiencebutalsostiflesinnovation,highlightingtheurgentneedforasolutioncanreconcile dualimperativesofprivacypreservationanddataaccessibility

1.3 Objectives

Themaingoalofthisprojectistodevelopareliableandsecuresystemthatprovidesequalaccesstomedicaldataforlearning andexperimentation.Weaimtoconstructafunctionalpipelinethatprocessesuserrequests,generatesmedicallyplausible data,anddeliversitinauser-friendlyformat.Thesystemisdesignedtoberobust,scalable,andsuitablefordeploymentintest andeducationalenvironments.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

To realize this vision, we have established three concrete objectives:

 To design and implement an intuitive conversational interface capableofunderstandingawiderangemedicalqueries posedinnaturallanguage.

 To create a robust data generation engine thatgeneratescomprehensive,statisticallyvalidsyntheticpatientrecords basedonuserinputandmedicalrules.

 To demonstrate a practical alternative to traditional methods bydeployingafullyfunctionalsystemthatreplacesthe needforrealorcrudelyanonymzeddataineducationalandprototypingcontexts.

2. RELATED WORK

Healthinformaticsresearchershavebeentryingtostrikeabalancebetweenpatientprivacyanddataanalysisforovera decade.Theyhaveattemptedtoutilizede-identificationframeworks,suchask-anonymityanddifferentialprivacy [1], [2] However,hackershavesuccessfullyemployedre-identificationtechniques,therebymatchinganonymousrecordswith externalsources,whichdisclosepatientinformation [3], [4].

(Search keywords: HIPAA privacy, k-anonymity, data re-identification attacks.)

TheresearchcommunityturnedtoSyntheticDataGeneration(SDG)asasaferalternative.SDGtechniquescreatedartificial datasetsthatmimickedreal-worlddatadistributions,withoutincorporatinganygenuinepatientrecords [5]. Generative AdversarialNetworks(GANs)haveemergedasaleadingarchitectureinthisfield.ModelslikeMedGANandCorGANhave generatedrealisticelectronichealthrecordsandhigh-qualitymedicalimages,providingvaluableresourcesfortraining andexperimentation [6], [7].

(Search keywords: GANs in healthcare, MedGAN, synthetic medical data.)

ParallelprogressinNaturalLanguageProcessing(NLP)hasalsotransformedhowunstructured medicaltext,including clinical narratives, examination notes, and discharge summaries, is interpreted and produced. The transformer-based architectures,includingBERT,BioBERT,andBioGPT,haveshownstrongperformanceinunderstandingclinicalvocabulary andcontextualrelationships,allowingforadvancedtextgenerationandinformationextractioninhealthcaresettings. [8], [9], [10].

(Search keywords: NLP in medicine, BERT for clinical notes, BioGPT.)

Conversationalagentsinhealthcarehavegainedsignificanttraction.Chatbotsnowassistintriagingpatientsymptoms, supportingmentalhealth,providingmedicalinstructions,andschedulingappointments [11], [12]. Educationalchatbotsare also used, which enhance learner engagement and offer quick access to medical knowledge [13]. However, most data generationtoolsaretechnical andrequireusers toemployscriptsorcoding [14]. Thispresentsa problemfor students, educators,andpractitioners:theyneedrealisticdatabutlacktheabilityto program.Recentworkproposescombining moregenerativeAImodelswithintuitivechat botinterfaces,enablinguserstosynthesizemedicaldata [15].

(Search keywords: Hhealth care chat bots, AI in medical education.)

3. PROPOSED SYSTEM

3.1 System Overview

Theproposedsystemisaninteractive,AI-poweredchatbotdesignedtogeneratesyntheticmedicalrecordsondemand.Unlike staticdatasetsorde-identifiedrecordsthatposepotentialprivacyrisks,thissystemutilizesagenerativeAImodeltocreate entirelyfictitiousyetstatisticallyrealisticpatientprofiles.Thesystemservesasabridgebetweencomplexdatageneration algorithmsandend-users(students,researchers),allowingthemtoobtainspecificmedicaldatapoints suchasdemographics, vitalsigns,anddiagnosishistories simplybyconversingwiththebotinnaturallanguage.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

3.2 System Architecture

Thesystemisbuiltuponamodulararchitecturecomprisingthreemainlayers:

1. User Interface (UI) Layer: Aweb-basedchat botinterface(developedusingframeworkslikeStreamlit orFlask)that capturesuserprompts.Itfocusesonaccessibility;ensuringnon-technicaluserscaneasilyrequestdata.

2. Application Logic Layer: Thisisbrainofsystem.ItcontainstheNaturalLanguageProcessing(NLP)machinethatparses userqueriestounderstandtheintent(e.g.,Itcommunicateswiththegenerativemodeltofetchorcreatetherequireddata attributes).

3. Data & Model Layer: This layer houses the trained AI models responsible for synthetic data generation and a NoSQL databaseforstoringinteractionlogsandgeneratedrecordsforsessioncontinuity.

[ThediagramshowsanarrowflowingfromUser->ChatbotInterface->NLPEngine->GenerativeModel->Database-> andbacktoUser.]

3.3 Workflow (IPO Model)

ThecoreIPO(Input→Processing→Output)functionalityfollowsalineardataflow: Input: User provides a natural language prompt, such as "Create a record for a 45-year-old male with Type 2 Diabetes symptoms."

Processing:

1. Tokenization & Intent Recognition: TheNLPengineextractsthechatbotpreprocessesthetexttoidentifythekeyentities (Age:45,Gender:Male,Condition:Diabetes).

2. Data Synthesis: TheAImodelusestheseentitiesasseedstogeneratecomplementarysyntheticdata(e.g.,elevatedblood sugarlevels,specificmedicationnameslikeMetformin)basedonmedicalrules.

3. Validation: Thesystemrunsalogiccheckstoensurethegenerateddataismedicallycoherent(e.g.,ensuringamalepatient isnotassignedpregnancy-relatedconditions).

Output: ThesystempresentsthesyntheticrecordinastructuredJSON/Tableformatwithinthechatwindow,readyforuserto viewordownload.

3.4 Workflow Steps

Theoperationalworkflowofthesystemisasfollows:

1. User Initialization: Theuseraccessesthewebportalandinitiatesthechatsession.

2. Query Submission: Theusertypesarequestforspecificmedicaldatascenarios.

Figure 1: System Architecture

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

3. Intent Analysis: TheNLPmoduledecodestherequestparameters.

4. Generation Phase: Thegenerativealgorithmconstructsthepatientprofile.

5. Response Delivery: Thechatbotdisplaysthegeneratedrecord.

6. Iterative Refinement: The user can ask for modifications (e.g., "Add a high blood pressure reading"), and the system updatestherecordinreal-time.

3.5 Comparison and Improvements

Thissystemintroducessignificantimprovementsoverexistingdataaccessmethods:

Table 1: Comparison Table

Feature Traditional Anonymization ExistingStatic Datasets ProposedAIChatbot System

PrivacyRisk Medium(Reidentification possible)

Low

Zero(Datais100% synthetic)

Accessibility Restricted/Requires Approval Publicbutlimited variety Instant/On-Demand

UserInterface RawCSV/Database Files ManualSearch Conversational(Natural Language)

Customization

None(Whatyouseeis whatyouget)

Low

High(Generateexactly whatyouneed)

Byshiftingfromstaticfileretrievaltodynamicgeneration,theproposedsystemsolvesthe"datascarcity"problemwhile ensuringstrictadherencetoethicalstandards.

4. METHODOLOGY and IMPLEMENTATION

Thissectionoutlinesthespecializedimplementationofthesystem,detailingthesoftwarestack,algorithmicapproach,andthe step-by-stepsenseusedtotransfigureuserqueriesintostructuredmedicalrecords.

4.1 Technology Stack

The system was developed using a Python-grounded ecosystem due to its support for data science and natural language processinglibraries.

Programming Language: Python3.9+

Frontend: Streamlit/(WebInterface).

Backend: Flask(forAPIrouting)/PythonNativeScripts. Natural Language Processing (NLP): NLTK(NaturalLanguageToolkit)andSpaCy(usedfortextpreprocessing,tokenization, andentityextraction).

Data Generation: FakerLibrary(customizedmedicalproviders),NumPyandPandas(probabilisticdistribution).

Database:MongoDB(NoSQL).

4.2 Implementation Logic

The core sense follows a channel approach:

Step 1: Preprocessing & Intent Recognition Thestoner'stextbookinputisfirstgutted(junkingofstopwords,lowercasing). Weemployedakeyword-spottingalgorithmenhancedwithNLTKtoidentifythecoreintent. Algorithm: Tokenization->Stop_Word_Removal->Keyword_Matching.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

Example: Input"Givemeacardiaccase"iscounterplottedtotheintent:generate_cardiac_profile.

Step 2: Entity Extraction Oncetheintentisknown,the systemreviewsforconstraints(entities)suchasage,gender,or specificconditions.

Code Logic: Iftheuserspecifies"Male,45years,"these valuesoverridethedefault arbitrarygenerators.Ifnovaluesare provided,thesystemdefaultstorandomsamplingwithinrealisticbounds(e.g.,Age:18–90).

Step 3: Statistical Data Synthesis (The Generator) this is the most critical module. Instead of purely random data, we enforced(ConstrainedProbabilisticSampling).

Medical Sense: Wedefinedrule-baseddependenciestoensuremedicalcoherence.

Rule A: IfCondition="Hypertension", BloodPressuresampledfromadistribution ratherthanthe normalrange.

Rule B: IfGender="Male",pregnancy-relatedfieldsareexcluded.

Step 4: Output Formatting ThegenerateddatawordbookisconvertedintoastructuredJSONformatoraPandasDataFrame fordisplayinthefrontend.

(Note:UserInput->NLPProcessor->senseMachine(Rules+Faker)->Database->UserUI)

5. RESULTS and DISCUSSIONS

5.1 Functional Output (UI)

Thesystemwassuccessfullydeployedonalocalserver.Theuserscaninteractwiththechatbottogeneraterecordsinstantly.

Figure 2: Implementation Logic
Figure 3: Chat bot Welcome Screen

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

[Showingtheopeningscreenofourchatbot,perhapswithamessage"Hello!Icangeneratesyntheticmedicalrecordsfor you.Tryasking:'Createadiabetespatientprofile'."]

[User:"Generatearecordfora60yearoldmalewithheartissues."Sothebotreplyingwithatable/JSONcontainingfields likeName,Age:60,BP:160/95,HeartRate:88,etc.]

5.2 Performance Analysis

We evaluated the system based on three crucial criteria: Response Time, Intent Accuracy, and Data Realism.

Response Latency: Theaveragetimetogenerateacompletepatientprofile(approx.20attributes)wasrecordedat0.85 seconds,makingthetoollargelyefficientforreal-timeeducationaluse.

Intent Recognition Accuracy: Inatestsetof50variedqueries(e.g.,"makeacase,""Ineeddataforfever,""female20years old"),thesystemrightlylinkedthestoner'sintent94%ofthetime.

Data Validity Check: Asenseconfirmationscriptwasrunon1,000generatedrecordstocheckforinsolvablecombinations (e.g.,a5-year-oldwithadriver'slicenseormismatchedvitalsigns).Thesystemachieveda98%logicalconsistencyrate,with minorerrorsonlyinedgecases.

Figure 4: Generated Record Table

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

Table 2: Generation Latency Metrics

5.3 Comparison with Existing Tools

ComparedtoexistingtoolslikeMockarooGraphicalUserInterface(GUI-based)orFaker(Code-based),oursystemprovidesa Zero-CodeexperiencewithContextAwareness,adjustingsymptomsdynamicallybasedontherequesteddisease.

6. CONCLUSION

Theprojectcreatedachatbotthataccesseshigh-qualitymedicaldata,whichisessentialforeffectivehealthcareeducation andresearch.Inthiswork,weproposed SynthoMed AI, whichgeneratedsynthetic medicalrecordsusingNLPanddata generationalgorithmstocreatepatientprofilesondemand.However,strictethicalandlegalrulesaboutpatientprivacy limit its accessibility. Testing showed the system worked well, achieving an accuracy of 94%. This project shows that generatingsyntheticdatadoesn'thavetobeacomplexprocessfordatascientistsalone.Usingaconversationalinterface, wehavemademedicaldatamoreeasilyaccessible.Thisresourceisavaluableforstudents,developers,andresearchers, allowingthemtotesttheirskillsandsystemswithoutcompromisingpatientsafety.

6.1 Future Developments

Futureresearchwillinvolvetherangeofexpandingmedicalconditionsandspecialtiescoveredintheguidelines,asadding supportformultiplelanguages.Therearemanyopportunitiesforimprovement,thesystemcurrentlyperformswell.We needavalidationframeworktoensureclinicalrealismandasecureAPIforexternaluse.Customizabletemplatesanddata exportoptionswillfurtherenhanceitsusability.

7. FUTURE SCOPE AND IMPROVEMENTS

Although the current model is able to accomplish its primary tasks and achieve its objectives, we believe that further improvementscansignificantlyenhanceitsrealismand usefulnessinthefollowing ways: IntegrationofMedical Imaging (MultimodalGeneration):Thechatbotcurrentlygeneratessyntheticpatientrecords.Futureversionsmayincludediffusion modelstoproducesyntheticmedicalimagesusingGenerativeAdversarialNetworks(GANs:suchasX-ray,MRI,orCTimages) thatareconsistentwiththetext-baseddiagnosis,therebycreatingafullycomprehensivepatientdataset.

Adoption of Large Language Models (LLMs): Therule-basedNLPlibrariesshouldbeupgradedtoadvancedLLMs(likeGPT-4 oropen-sourceLLaMAmodels),whichwouldsignificantlyenhancethechatbot’sabilitytounderstandadvanced,complex medicalqueriesandgenerateunstructuredclinicalnotes(e.g.,doctor’sdischargesummaries)withhigherfluency.

Standardization (HL7/FHIR Support): Real-worldhospitalsoftwaretestingisnecessarytomakethedatamoreuseful.Future workbewillfocusonexportingsyntheticrecordsinindustry-standardformats,suchasHealthLevelSeven(HL7)orFast HealthcareInteroperabilityResources(FHIR).

Voice-Enabled Interface: AddingText-to-Speech(TTS)andSpeech-to-Text(STT)capabilitieswouldmakethesystemmore accessible,allowinguserstointeractwiththebotverbally-simulatingareal-worlddictationscenario.

Federated Learning Integration: Thesyntheticrecordapproachwillimprovestatisticalrealismwithoutaccessingrawdata directly. This would allow the model to learn patterns from real hospital data locally without the data ever leaving the hospital'ssecureservers.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072

ACKNOWLEDGEMENT

WeextendoursinceregratitudetoAssistantProfessor.MaryAnitaT,forhercontinuoussupport,invaluableguidance,and helpfulfeedbackthroughoutthisproject.WealsothanktheDepartmentofartificialintelligenceandmachinelearningatThe oxford college of engineering, affiliated with Visvesvaraya technological university (VTU), for providing the necessary resourcesandinfrastructureforcreatingtheacademicframeworkforthisresearch.Aspecialthankyoutoourcolleaguesand testusers,whosefeedbacksignificantlyimprovedtheusabilityandeffectivenessofthesystem.

REFERENCES

[1]L.Sweeney,"k-anonymity:Amodelforprotectingprivacy,"InternationalJournalofUncertainty,Fuzzinessand Knowledge-BasedSystems,2002.

[2]C.Dwork,"DifferentialPrivacy,"in33rdInternationalColloquiumonAutomata,LanguagesandProgramming (ICALP),Venice,Italy,2006.

[3]K.ElEmam,E.Jonker,L.Arbuckle,andB.Malin,"Asystematicreviewofre-identificationattacksonhealthdata," PLoSOne,2011.

[4]L.Rocher,J.M.Hendrickx,andY.A.deMontjoye,"Estimatingthesuccessofre-identificationsinincomplete datasetsusinggenerativemodels,"NatureCommunications,p.3069,2019.

[5]A.Tucker,Z.Wang,U.Rayson,andG.H.Collins,"Generatingsyntheticrecordforhealthcareapplications," ArtificialIntelligenceinMedicine,p.101744,2020.

[6]E.Choi,S.Biswal,J.Duke,W.F.Stewart,andJ.Sun,"GeneratingMulti-labelDiscretePatientRecordsUsingGANs," intheProceedingsof2nd MLforHealthcareConference,2017,pp.286-305.

[7]A.TorfiandE.A.Fox,"CorGAN:Correlation-CapturingConvolutionalGANsforGeneratingSyntheticHealthcare Records,"intheProceedingsof33rdInternationalFLAIRSConference,2020.

[8]R.Luo,L.Sun,Y.Xia,T.Qin,S.Zhang,H.Poon,andT.Liu,"BioGPT:Apre-trainedgenerativetransformertailored forbiomedicaltextcreationandMining,"BriefingsinBioinformatics,2022.

[9]K.Huang,J.Altosaar,andR.Ranganath,"ClinicalBERT:ModelingClinicalNotes&PredictingHospital Readmission,"arXivpreprintarXiv:1904.05342,2019.

[10]J.Devlin,M.Chang,K.Lee,andK.Toutanova,"BERTisaDeepBidirectionalTransformersmodelthatLearns LanguageforPre_trainingtoUnderstanding,"inProceedingsofNAACL-HLT,2019,pp.4171-4186.

[11]L.Laranjoetal.,"ConversationalAgentsinHealthcare:ASystematicReview,"JournalofAmericanMedical InformaticsAssociation,pp.12481258,2018.

[12]L.TudorCaretal.,"Conversationalagentsinhealthcare:scopingreviewandevidencemap,"JournalofMedical InternetResearch,p.e17158,2020.

[13]A.N.A.Tlili,F.Essalmi,andM.Jemni,"ASmartChatbotforEducationalContext,"inIEEE18thInternational ConferenceonAdvancedLearningTechnologies(ICALT),2018.

[14]O.Seneviratne,D.McGuinness,andJ.Goncalves,P.Ray,"Evaluationandgeneratingsyntheticmedialdata," 2020.

[15]N.Patwaetal.,"AIChatbotsinHealthcare:AReview,"JournalofHealthcareEngineering,vol.2023,ArticleID 9946821,2023.

Turn static files into dynamic content formats.

Create a flipbook