
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
![]()

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
SURYA G 1 , MOHAMMED RASOOL R 2 , SHRINANDA HK 3, MOHAMMED FLAH LAHORI 4
5 Assit.Prof: Mary Anitha T 1,2,3,4 Dept. of Artificial Intelligence & Machine Learning Engineering, The Oxford College of Engineering, Bommanahalli, Bengaluru-68, Karnataka, India
5 Dept. of AIML Engineering, the Oxford College of Engineering, Bommanahalli, Bengaluru-68, Karnataka, India ***
Abstract - Access to authentic medical records is an essential part of healthcare education, yet it is often restricted due to strict patient privacy regulations and ethical concerns. This paper is presenting a novel going to addresses the challenges by AI-driven chat bot system designed to generate high-fidelity synthetic medical records. The project methodology encompasses a complete development lifecycle, including data pre-processing, AI model training, chat bot integration, and final deployment. By leveraging Natural Language Processing (NLP) and probabilistic sampling, the system allows students and healthcare learners to query and generate realistic patient data scenarios without risking the exposure of sensitive personal information. The results demonstrate a functional, interactive tool that democratizes access to medical data for training purposes. Furthermore, the development process highlights the efficacy of combining deep learning with conversational interfaces to solve practical challenges in health informatics and technical education.
Key Words: Synthetic, AI Chat bot, Synthetic Medical Data(SMD), Healthcare Privacy, Generative Adversarial Networks (GANs), Natural Language Toolkit (NLTK), Synthetic Data Generation (SDG), Medical Informatics, Health Insurance Portability and Accountability Act (HIPAA), Personally Identifiable Information (PII), Electronic Health Records (EHR), Natural Language Processing (NLP).
1.1 Overview
Thehealthcareindustryisundergoingrapidtransformation,drivensignificantlybyadvancesindataanalysisandartificial intelligence(AI).High-qualitymedicaldataformstheessentialfoundationfornumerouscriticalactivitieslike:trainingfuture healthcareprofessionals,supportingclinicalresearch,anddevelopingrobustAIdiagnostictools.ElectronicHealthRecords (EHRs) have become the digital standard, encapsulating patient histories, laboratory results, and treatment plans in a structuredformat.However,apersistenttensionexistsbetweentheimmensevalueofthisdataandtheparamountneedto protectpatientprivacy.Thispaperintroducesanovelapproachdesignedtonavigatethistension:anAI-poweredchatbot systemnamed Syntho Med AI,whichgeneratesrealisticsyntheticmedicalrecords.Bygeneratingsyntheticpatientprofiles thatreplicatethestatisticalandlinguisticpatternsofrealdata,withoutcontaininganyactualpersonalinformation,thisdevice aimstomakehigh-qualitymedicaldataaccessibletoallforresearch,educationandinnovation,whileupholdingtostrictethical standardsofpatientconfidentiality.
1.2
Thereal-worldmedicalrecordisrestrictedandaccessforeducationalandresearchpurposes,thuscreatingwell-documented advancements in health care. This limitation arises due to rigorous ethical and legal frameworks, most especially those surrounding the strict protection of Personally Identifiable Information (PII) as through regulations such as the Health InsurancePortabilityandAccountabilityAct(HIPAA).Theseareprotectingpatientrights,buthaveunintendedofcreating formidablebarriersinpathofstudents,researchers,anddevelopers.Theseindividualsrequirerich,realisticdatasetsinorder tohoneanalyticalskills,testnewsoftware.Theresultsareascarcityofpractical,legallycompliantlearningresources.This spacenotonlylimitspracticalexperiencebutalsostiflesinnovation,highlightingtheurgentneedforasolutioncanreconcile dualimperativesofprivacypreservationanddataaccessibility
1.3 Objectives
Themaingoalofthisprojectistodevelopareliableandsecuresystemthatprovidesequalaccesstomedicaldataforlearning andexperimentation.Weaimtoconstructafunctionalpipelinethatprocessesuserrequests,generatesmedicallyplausible data,anddeliversitinauser-friendlyformat.Thesystemisdesignedtoberobust,scalable,andsuitablefordeploymentintest andeducationalenvironments.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
To realize this vision, we have established three concrete objectives:
 To design and implement an intuitive conversational interface capableofunderstandingawiderangemedicalqueries posedinnaturallanguage.
 To create a robust data generation engine thatgeneratescomprehensive,statisticallyvalidsyntheticpatientrecords basedonuserinputandmedicalrules.
 To demonstrate a practical alternative to traditional methods bydeployingafullyfunctionalsystemthatreplacesthe needforrealorcrudelyanonymzeddataineducationalandprototypingcontexts.
Healthinformaticsresearchershavebeentryingtostrikeabalancebetweenpatientprivacyanddataanalysisforovera decade.Theyhaveattemptedtoutilizede-identificationframeworks,suchask-anonymityanddifferentialprivacy [1], [2] However,hackershavesuccessfullyemployedre-identificationtechniques,therebymatchinganonymousrecordswith externalsources,whichdisclosepatientinformation [3], [4].
(Search keywords: HIPAA privacy, k-anonymity, data re-identification attacks.)
TheresearchcommunityturnedtoSyntheticDataGeneration(SDG)asasaferalternative.SDGtechniquescreatedartificial datasetsthatmimickedreal-worlddatadistributions,withoutincorporatinganygenuinepatientrecords [5]. Generative AdversarialNetworks(GANs)haveemergedasaleadingarchitectureinthisfield.ModelslikeMedGANandCorGANhave generatedrealisticelectronichealthrecordsandhigh-qualitymedicalimages,providingvaluableresourcesfortraining andexperimentation [6], [7].
(Search keywords: GANs in healthcare, MedGAN, synthetic medical data.)
ParallelprogressinNaturalLanguageProcessing(NLP)hasalsotransformedhowunstructured medicaltext,including clinical narratives, examination notes, and discharge summaries, is interpreted and produced. The transformer-based architectures,includingBERT,BioBERT,andBioGPT,haveshownstrongperformanceinunderstandingclinicalvocabulary andcontextualrelationships,allowingforadvancedtextgenerationandinformationextractioninhealthcaresettings. [8], [9], [10].
(Search keywords: NLP in medicine, BERT for clinical notes, BioGPT.)
Conversationalagentsinhealthcarehavegainedsignificanttraction.Chatbotsnowassistintriagingpatientsymptoms, supportingmentalhealth,providingmedicalinstructions,andschedulingappointments [11], [12]. Educationalchatbotsare also used, which enhance learner engagement and offer quick access to medical knowledge [13]. However, most data generationtoolsaretechnical andrequireusers toemployscriptsorcoding [14]. Thispresentsa problemfor students, educators,andpractitioners:theyneedrealisticdatabutlacktheabilityto program.Recentworkproposescombining moregenerativeAImodelswithintuitivechat botinterfaces,enablinguserstosynthesizemedicaldata [15].
(Search keywords: Hhealth care chat bots, AI in medical education.)
3.1 System Overview
Theproposedsystemisaninteractive,AI-poweredchatbotdesignedtogeneratesyntheticmedicalrecordsondemand.Unlike staticdatasetsorde-identifiedrecordsthatposepotentialprivacyrisks,thissystemutilizesagenerativeAImodeltocreate entirelyfictitiousyetstatisticallyrealisticpatientprofiles.Thesystemservesasabridgebetweencomplexdatageneration algorithmsandend-users(students,researchers),allowingthemtoobtainspecificmedicaldatapoints suchasdemographics, vitalsigns,anddiagnosishistories simplybyconversingwiththebotinnaturallanguage.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
Thesystemisbuiltuponamodulararchitecturecomprisingthreemainlayers:
1. User Interface (UI) Layer: Aweb-basedchat botinterface(developedusingframeworkslikeStreamlit orFlask)that capturesuserprompts.Itfocusesonaccessibility;ensuringnon-technicaluserscaneasilyrequestdata.
2. Application Logic Layer: Thisisbrainofsystem.ItcontainstheNaturalLanguageProcessing(NLP)machinethatparses userqueriestounderstandtheintent(e.g.,Itcommunicateswiththegenerativemodeltofetchorcreatetherequireddata attributes).
3. Data & Model Layer: This layer houses the trained AI models responsible for synthetic data generation and a NoSQL databaseforstoringinteractionlogsandgeneratedrecordsforsessioncontinuity.

[ThediagramshowsanarrowflowingfromUser->ChatbotInterface->NLPEngine->GenerativeModel->Database-> andbacktoUser.]
ThecoreIPO(Input→Processing→Output)functionalityfollowsalineardataflow: Input: User provides a natural language prompt, such as "Create a record for a 45-year-old male with Type 2 Diabetes symptoms."
Processing:
1. Tokenization & Intent Recognition: TheNLPengineextractsthechatbotpreprocessesthetexttoidentifythekeyentities (Age:45,Gender:Male,Condition:Diabetes).
2. Data Synthesis: TheAImodelusestheseentitiesasseedstogeneratecomplementarysyntheticdata(e.g.,elevatedblood sugarlevels,specificmedicationnameslikeMetformin)basedonmedicalrules.
3. Validation: Thesystemrunsalogiccheckstoensurethegenerateddataismedicallycoherent(e.g.,ensuringamalepatient isnotassignedpregnancy-relatedconditions).
Output: ThesystempresentsthesyntheticrecordinastructuredJSON/Tableformatwithinthechatwindow,readyforuserto viewordownload.
Theoperationalworkflowofthesystemisasfollows:
1. User Initialization: Theuseraccessesthewebportalandinitiatesthechatsession.
2. Query Submission: Theusertypesarequestforspecificmedicaldatascenarios.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
3. Intent Analysis: TheNLPmoduledecodestherequestparameters.
4. Generation Phase: Thegenerativealgorithmconstructsthepatientprofile.
5. Response Delivery: Thechatbotdisplaysthegeneratedrecord.
6. Iterative Refinement: The user can ask for modifications (e.g., "Add a high blood pressure reading"), and the system updatestherecordinreal-time.
3.5 Comparison and Improvements
Thissystemintroducessignificantimprovementsoverexistingdataaccessmethods:
Table 1: Comparison Table
Feature Traditional Anonymization ExistingStatic Datasets ProposedAIChatbot System
PrivacyRisk Medium(Reidentification possible)
Low
Zero(Datais100% synthetic)
Accessibility Restricted/Requires Approval Publicbutlimited variety Instant/On-Demand
UserInterface RawCSV/Database Files ManualSearch Conversational(Natural Language)
Customization
None(Whatyouseeis whatyouget)
Low
High(Generateexactly whatyouneed)
Byshiftingfromstaticfileretrievaltodynamicgeneration,theproposedsystemsolvesthe"datascarcity"problemwhile ensuringstrictadherencetoethicalstandards.
4. METHODOLOGY and IMPLEMENTATION
Thissectionoutlinesthespecializedimplementationofthesystem,detailingthesoftwarestack,algorithmicapproach,andthe step-by-stepsenseusedtotransfigureuserqueriesintostructuredmedicalrecords.
4.1 Technology Stack
The system was developed using a Python-grounded ecosystem due to its support for data science and natural language processinglibraries.
Programming Language: Python3.9+
Frontend: Streamlit/(WebInterface).
Backend: Flask(forAPIrouting)/PythonNativeScripts. Natural Language Processing (NLP): NLTK(NaturalLanguageToolkit)andSpaCy(usedfortextpreprocessing,tokenization, andentityextraction).
Data Generation: FakerLibrary(customizedmedicalproviders),NumPyandPandas(probabilisticdistribution).
Database:MongoDB(NoSQL).
4.2 Implementation Logic
The core sense follows a channel approach:
Step 1: Preprocessing & Intent Recognition Thestoner'stextbookinputisfirstgutted(junkingofstopwords,lowercasing). Weemployedakeyword-spottingalgorithmenhancedwithNLTKtoidentifythecoreintent. Algorithm: Tokenization->Stop_Word_Removal->Keyword_Matching.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
Example: Input"Givemeacardiaccase"iscounterplottedtotheintent:generate_cardiac_profile.
Step 2: Entity Extraction Oncetheintentisknown,the systemreviewsforconstraints(entities)suchasage,gender,or specificconditions.
Code Logic: Iftheuserspecifies"Male,45years,"these valuesoverridethedefault arbitrarygenerators.Ifnovaluesare provided,thesystemdefaultstorandomsamplingwithinrealisticbounds(e.g.,Age:18–90).
Step 3: Statistical Data Synthesis (The Generator) this is the most critical module. Instead of purely random data, we enforced(ConstrainedProbabilisticSampling).
Medical Sense: Wedefinedrule-baseddependenciestoensuremedicalcoherence.

Rule A: IfCondition="Hypertension", BloodPressuresampledfromadistribution ratherthanthe normalrange.
Rule B: IfGender="Male",pregnancy-relatedfieldsareexcluded.
Step 4: Output Formatting ThegenerateddatawordbookisconvertedintoastructuredJSONformatoraPandasDataFrame fordisplayinthefrontend.

(Note:UserInput->NLPProcessor->senseMachine(Rules+Faker)->Database->UserUI)
5. RESULTS and DISCUSSIONS
5.1 Functional Output (UI)
Thesystemwassuccessfullydeployedonalocalserver.Theuserscaninteractwiththechatbottogeneraterecordsinstantly.


International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
[Showingtheopeningscreenofourchatbot,perhapswithamessage"Hello!Icangeneratesyntheticmedicalrecordsfor you.Tryasking:'Createadiabetespatientprofile'."]

[User:"Generatearecordfora60yearoldmalewithheartissues."Sothebotreplyingwithatable/JSONcontainingfields likeName,Age:60,BP:160/95,HeartRate:88,etc.]

5.2 Performance Analysis
We evaluated the system based on three crucial criteria: Response Time, Intent Accuracy, and Data Realism.
Response Latency: Theaveragetimetogenerateacompletepatientprofile(approx.20attributes)wasrecordedat0.85 seconds,makingthetoollargelyefficientforreal-timeeducationaluse.
Intent Recognition Accuracy: Inatestsetof50variedqueries(e.g.,"makeacase,""Ineeddataforfever,""female20years old"),thesystemrightlylinkedthestoner'sintent94%ofthetime.
Data Validity Check: Asenseconfirmationscriptwasrunon1,000generatedrecordstocheckforinsolvablecombinations (e.g.,a5-year-oldwithadriver'slicenseormismatchedvitalsigns).Thesystemachieveda98%logicalconsistencyrate,with minorerrorsonlyinedgecases.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
Table 2: Generation Latency Metrics
ComparedtoexistingtoolslikeMockarooGraphicalUserInterface(GUI-based)orFaker(Code-based),oursystemprovidesa Zero-CodeexperiencewithContextAwareness,adjustingsymptomsdynamicallybasedontherequesteddisease.
Theprojectcreatedachatbotthataccesseshigh-qualitymedicaldata,whichisessentialforeffectivehealthcareeducation andresearch.Inthiswork,weproposed SynthoMed AI, whichgeneratedsynthetic medicalrecordsusingNLPanddata generationalgorithmstocreatepatientprofilesondemand.However,strictethicalandlegalrulesaboutpatientprivacy limit its accessibility. Testing showed the system worked well, achieving an accuracy of 94%. This project shows that generatingsyntheticdatadoesn'thavetobeacomplexprocessfordatascientistsalone.Usingaconversationalinterface, wehavemademedicaldatamoreeasilyaccessible.Thisresourceisavaluableforstudents,developers,andresearchers, allowingthemtotesttheirskillsandsystemswithoutcompromisingpatientsafety.
Futureresearchwillinvolvetherangeofexpandingmedicalconditionsandspecialtiescoveredintheguidelines,asadding supportformultiplelanguages.Therearemanyopportunitiesforimprovement,thesystemcurrentlyperformswell.We needavalidationframeworktoensureclinicalrealismandasecureAPIforexternaluse.Customizabletemplatesanddata exportoptionswillfurtherenhanceitsusability.
Although the current model is able to accomplish its primary tasks and achieve its objectives, we believe that further improvementscansignificantlyenhanceitsrealismand usefulnessinthefollowing ways: IntegrationofMedical Imaging (MultimodalGeneration):Thechatbotcurrentlygeneratessyntheticpatientrecords.Futureversionsmayincludediffusion modelstoproducesyntheticmedicalimagesusingGenerativeAdversarialNetworks(GANs:suchasX-ray,MRI,orCTimages) thatareconsistentwiththetext-baseddiagnosis,therebycreatingafullycomprehensivepatientdataset.
Adoption of Large Language Models (LLMs): Therule-basedNLPlibrariesshouldbeupgradedtoadvancedLLMs(likeGPT-4 oropen-sourceLLaMAmodels),whichwouldsignificantlyenhancethechatbot’sabilitytounderstandadvanced,complex medicalqueriesandgenerateunstructuredclinicalnotes(e.g.,doctor’sdischargesummaries)withhigherfluency.
Standardization (HL7/FHIR Support): Real-worldhospitalsoftwaretestingisnecessarytomakethedatamoreuseful.Future workbewillfocusonexportingsyntheticrecordsinindustry-standardformats,suchasHealthLevelSeven(HL7)orFast HealthcareInteroperabilityResources(FHIR).
Voice-Enabled Interface: AddingText-to-Speech(TTS)andSpeech-to-Text(STT)capabilitieswouldmakethesystemmore accessible,allowinguserstointeractwiththebotverbally-simulatingareal-worlddictationscenario.
Federated Learning Integration: Thesyntheticrecordapproachwillimprovestatisticalrealismwithoutaccessingrawdata directly. This would allow the model to learn patterns from real hospital data locally without the data ever leaving the hospital'ssecureservers.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 13 Issue: 01 | Jan 2026 www.irjet.net p-ISSN: 2395-0072
WeextendoursinceregratitudetoAssistantProfessor.MaryAnitaT,forhercontinuoussupport,invaluableguidance,and helpfulfeedbackthroughoutthisproject.WealsothanktheDepartmentofartificialintelligenceandmachinelearningatThe oxford college of engineering, affiliated with Visvesvaraya technological university (VTU), for providing the necessary resourcesandinfrastructureforcreatingtheacademicframeworkforthisresearch.Aspecialthankyoutoourcolleaguesand testusers,whosefeedbacksignificantlyimprovedtheusabilityandeffectivenessofthesystem.
[1]L.Sweeney,"k-anonymity:Amodelforprotectingprivacy,"InternationalJournalofUncertainty,Fuzzinessand Knowledge-BasedSystems,2002.
[2]C.Dwork,"DifferentialPrivacy,"in33rdInternationalColloquiumonAutomata,LanguagesandProgramming (ICALP),Venice,Italy,2006.
[3]K.ElEmam,E.Jonker,L.Arbuckle,andB.Malin,"Asystematicreviewofre-identificationattacksonhealthdata," PLoSOne,2011.
[4]L.Rocher,J.M.Hendrickx,andY.A.deMontjoye,"Estimatingthesuccessofre-identificationsinincomplete datasetsusinggenerativemodels,"NatureCommunications,p.3069,2019.
[5]A.Tucker,Z.Wang,U.Rayson,andG.H.Collins,"Generatingsyntheticrecordforhealthcareapplications," ArtificialIntelligenceinMedicine,p.101744,2020.
[6]E.Choi,S.Biswal,J.Duke,W.F.Stewart,andJ.Sun,"GeneratingMulti-labelDiscretePatientRecordsUsingGANs," intheProceedingsof2nd MLforHealthcareConference,2017,pp.286-305.
[7]A.TorfiandE.A.Fox,"CorGAN:Correlation-CapturingConvolutionalGANsforGeneratingSyntheticHealthcare Records,"intheProceedingsof33rdInternationalFLAIRSConference,2020.
[8]R.Luo,L.Sun,Y.Xia,T.Qin,S.Zhang,H.Poon,andT.Liu,"BioGPT:Apre-trainedgenerativetransformertailored forbiomedicaltextcreationandMining,"BriefingsinBioinformatics,2022.
[9]K.Huang,J.Altosaar,andR.Ranganath,"ClinicalBERT:ModelingClinicalNotes&PredictingHospital Readmission,"arXivpreprintarXiv:1904.05342,2019.
[10]J.Devlin,M.Chang,K.Lee,andK.Toutanova,"BERTisaDeepBidirectionalTransformersmodelthatLearns LanguageforPre_trainingtoUnderstanding,"inProceedingsofNAACL-HLT,2019,pp.4171-4186.
[11]L.Laranjoetal.,"ConversationalAgentsinHealthcare:ASystematicReview,"JournalofAmericanMedical InformaticsAssociation,pp.12481258,2018.
[12]L.TudorCaretal.,"Conversationalagentsinhealthcare:scopingreviewandevidencemap,"JournalofMedical InternetResearch,p.e17158,2020.
[13]A.N.A.Tlili,F.Essalmi,andM.Jemni,"ASmartChatbotforEducationalContext,"inIEEE18thInternational ConferenceonAdvancedLearningTechnologies(ICALT),2018.
[14]O.Seneviratne,D.McGuinness,andJ.Goncalves,P.Ray,"Evaluationandgeneratingsyntheticmedialdata," 2020.
[15]N.Patwaetal.,"AIChatbotsinHealthcare:AReview,"JournalofHealthcareEngineering,vol.2023,ArticleID 9946821,2023.