Real-Time Code Plagiarism Detection Using NLP and Machine Learning for Academic and Industry Applica

Page 1


International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

Real-Time Code Plagiarism Detection Using NLP and Machine Learning for Academic and Industry Applications

1Master of Technology, Computer Science and Engineering, Lucknow Institute of Technology, Lucknow, India

2Assistant Professor, Department of Computer Science and Engineering, Lucknow Institute of Technology, Lucknow, India

Abstract - Code plagiarism has become a complex problem to research and development in the current academic and industrial environment. The old systems of detection like MOSS and JPlag focus more on the syntactic similarity andthis limitstheireffectiveness when itcomes to some advanced techniques of plagiarism like Paraphrasing, renaming variable and code generated by AI. The given research implements a language-agnostic and real-time code plagiarism detection framework that is based on the combination of Natural Language Processing (NLP) and Machine Learning (ML) under the specific names of CodeBERT, Graph Neural Networks (GNNs) and Abstract Syntax Trees (ASTs). The system adopts a hybrid structure by studying the syntactical as well as semantical characteristics of the source code in different languages, among which include Python, Java, and C++ to mention a few. With its integration with technologies, such as Apache Kafka, FastAPI, Redis, and Kubernetes, the suggested solution enables real-time submissions of the code, with a response time of less than 150 milliseconds and high throughput.

More than 15,000 code samples were used to train and validate the model, comprising of both industrial and academic code and synthetically obfuscated code as well as code generated by AI. The performance of evaluation has improved to a large extent in comparison to the traditional tools recording an F1- score of 91 percent in academic and 88 percent in industrial purposes. Case studies indicate that the academic dishonesty reduced by 30 percent, and violations of licenses were detected in enterprise codebases. The contribution of this work is not only in the next step towards state of code similarity detection but also in ability to conduct ethical and scalable content delivery of originality in education as well as in professional software development.

Key Words: CodePlagiarismDetection,NaturalLanguage Processing (NLP), Machine Learning (ML), CodeBERT, Abstract Syntax Tree (AST), Graph Neural Networks (GNN),Real-TimeDetection,Cross-LanguageAnalysis.

1. INTRODUCTION

During recent years, reuse, adoption, and reproduction of the source code has experienced a significant increase

with exponential growth of software development in the academic environment, as well as on the industry. Although the concepts of code sharing and the collaboration are the core elements of contemporary development approaches, they introduce the multifaceted problem of the intellectual property preservation and academic integrity. Intentional or unintended plagiarism of a source code subsequently jeopardizes the validity of academic testing and intellectual property of business software systems. As more and more institutions and organizations become dependent on coding assignments, shared repositories and ongoing development processes there occurs an urgent necessity in the need of highfidelity, completely automated plagiarism detection systems that could handle the emerging environment of codereuse.

1.1 Motivation

This research is motivated by the rising rate of incidence and level of sophistication of code plagiarism in the academiaandindustry.Studentsintheeducationalsystem have a reputation of reproducing programming projects taken by their fellow students through forums or even usingartificialintelligencesupportedsitessuchasGitHub Copilot without any credits. This negates outcomes in learning and invalidates validity of assessment. On the industrial-side, unlicensed reuse of the ownership code, reverse chicanery and non-compliance with open-source permissions can raise severe risks, such as law suits and tainting an image. Consequently, open-source ecosystem, being a model of innovation, is frequently abused by developers, who do not follow the guidelines of the

Figure-1: Evaluation of Plagiarism Detection

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

licensing. All these challenges create the necessity to develop the highly capable plagiarism detection mechanisms that are not restricted to the mere syntactic matching.

1.2 Problem Statement

ThoughtoolstocheckplagiarismintheformofMOSSand JPlag are developed, the available plagiarism detect systems do not have much scopeand potential to it. They are mostly syntax driven tools that operate by either performing a string check or token comparison and are ineffective in finding the plagiarism where the underlying logic is still hovering on but is obfuscated by either changing the variable names, altering the control structure,or variable rephrasing.Also,the majorityofthe systems are language-specific, and they cannot be scaled and are not applicable in the environment where several programming languages are present. Another important feature in such processes as classroom assessments and CI/CD pipelines, real-time feedback, is also almost nonexistent in the current solutions. The semantics recognition and the non-language dependent detection renderssuchtoolsunsuitabletobeappliedtolargescales andafast-pacedacademicorindustrialenvironment.

1.3 Research Objectives

Based on such limitations, the main goal of the presented study is to come up with the real-time, semantic-aware and language-independent plagiarism detection framework. The suggested system attempts to investigate notonlythesyntactic,butalsothesemanticfeaturesofthe source code and thus allow recognizing the use of some advanced plagiarism methods which are not to be identifiedwiththehelpoftheusual tools.Theframework is programmed effectively to cross over in different programming languages like Python, Java, and C++ optimization so that it has an impetus on widespread adoption in varied academic programs and business code stores.Theotherfundamentalaimistheprovisionofsuch of functionality in real-time so that educators and reviewersof the software usecan be provided immediate feedback about the software by means of its integration with tools including learning management systems or versioncontrolworkflows.

1.4 Contributions

A novel contribution of this study is a combination of the best of both worlds i.e., the rule-based and machine learning ideas in the code similarity recognition problem. The system combines the NLP models such as CodeBERT andagraphmodelsuchasGraphNeuralNetwork(GNNs) to consider structural and functional properties of the code. The structure of code can be normalized in code Abstract Syntax Trees (ASTs). semantic embeddings can be applied to determine logical equivalence in even

complex or obfuscated code and code produced by artificial intelligence.FastAPIis a web framework used to developasynchronoussystems,ApacheKafkaisastreamdependentingestionsystem,andKubernetesisadynamic resourcegenerationsystem,thatensurestheoperationof thesysteminreal-timeandscalesitsimplementation.The infrastructure is widely tested against scholarly data corpus and real-world codebases, showing its capacity to identify sophisticated instances of plagiarism with a high precisionrate,lowresponsetime,andexcellentversatility to scale. These contributions make the proposed solution as a viable, creative and technical viable development withinthefieldofplagiarismincodes.

2. RELATED WORK

Since there is a growing concern of the importance to observe academic integrity and safeguard intellectual property in software market, detecting plagiarism in source code has become one of the fields of active research. Through the years developers have come up with different ways to detect copied or reorganized code, both syntactic methods as well as new time based approaches with machine learning and natural language processing. These approaches however vary much in the aspect of capturing the semantic meaning, managing obfuscation, scaling effectively and working within the contextofmanyprogramminglanguages.

2.1 Traditional Plagiarism Detection Tools

The earlier plagiarism detection systems dwelt more on syntactic similarity which made use of token-based analysis.Awell-knowntoolofthiskindisMOSS(Measure of Software Similarity) that transforms source code into one of the sequences of tokens and then finds similarities between code submissions through the winnowing algorithms.MOSSisalsoagoodcompanionwhenitcomes to searching copy-exact or near-copy exact plagiarism, especiallywhenthecoding structureisonlymodifiedina minor way. On the same note, JPlag is another classical methodwhichimprovesthedetectionprocessbybuilding the Abstract Syntax Trees (ASTs) to match the structural representation of the code instead of raw tokens. This enables one to have a clearer idea of logical structuring, and also be able to represent similarity in a way that is going deeper than the formatting. The other methods are based on software measure like cyclomatic complexity, token counting and execution path to identify the anomalous submission that are not expected in a programming behavior. Though these techniques tell us thattheyhavesomestrengths,especiallyinthespeedand simplistic nature, when it comes to locating the semantic similarity,theyfailmostofthetime,especiallywhencode undergoesobfuscationandwhendifferentcodewrittento accomplish the same thing is rewritten using different logicbuthasthesamefunctionality.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

2.2 NLP and Machine Learning in Code Understanding

As a result of the appearance of very large scale programming data and the relative maturity of NLP and machinelearning,morecomplexmodelshavebeenableto becreatedtoidentifythefeaturesofsourcecode.Thegoal of these methods is to lift above syntax (uninteresting representations), and instead obtain significant descriptions of code semantics. In case of code2vec the focusisonusingneuralnetworkstotranslateASTpathsto denseembeddingssothatnowfunctionscanbecompared based on their semantics. This was a serious break with rule-based approaches, introducing a model that became capable of overcoming differences in style and implementationofcodes.CodeBERT:aBERTadaptationto programming languages allowed performing contextual understanding on various programming languages by training with very big corpora of code on GitHub. It facilitatessemanticcomparisonduetothecapabilityithas of catching a relationship between identifiers, operations and functions. Graph Neural Networks (GNNs) also increased the research of this area through network representationofcontrolflowanddatadependencyinthe code. GNNs can recognize similarity of codes when programlogicundergoessignificantrestructuringbecause the former computes the execution path of functions not the token sequence. Such achievements also preconditioned the development of the cross-language plagiarism detection. As an example, CodeT5 and CLCDSA models allow checking the equivalence across languages (Python and Java, in the case), by aligning their semantic representations, on which basis the plagiarism of traversed code can be detected even when the code is cross-languagetranslated.

2.3 Limitations of Existing Approaches

Even though, both rule-based and learning-based approaches have come a long way, current plagiarism detection systems have not gone without critical shortcomings. The biggest negative is that they do not identify well paraphrased orAI-generatedcode.The most commonly used tools, such as MOSS and JPlag, rely severely on the syntactic similarity and are ineffective in the cases when plagiarized code had its wording changed so that it was more difficult to recognize its origin. Even machinelearningmodelsthathaveprovenverysuccessful in the task of semantic comparison have difficulty detecting code generated by tools like GitHub Copilot whichwillbothbenewintermsofsyntax,butmisleading in terms of logic. The aforementioned weakness is that most of the detection framework lacks the capability of real-time processing. A large number of sophisticated systems are based on batch processing and are not appropriate in the dynamic setting like live evaluation of classroom or CI pipeline. Also, the problem of language

dependencyisabottleneck.Thetoolswhicharetrainedon C-only corpora are not universal across the programming languages, and program parsers and representations do not have universal frameworks so it is quite hard to construct scalable, cross-language detectors. Scalability is also not easy particularly when one is dealing with large chunkofcodesthatarecommoninenterpriseprograms.A great number of the current systems cannot handle millionsoflinesofcodewithoutbeingagonizinglyslowor memoryhungry.Inthatway,ontheonehand,thecurrent methods provide meaningful means of achieving basic detection, but, on the other hand, they fail to satisfy the requirements of real-time, multi-language, and semantically abundant code plagiarism analysis in the contemporaryacademicandindustrialenvironment.

3. METHODOLOGY

In order to build a solid and scalable system, capable of detectingthe plagiarism inthecode in real-time,a hybrid approach has been adopted to combine both old schemes of the syntactic detection of plagiarism with the newer ones of semantic analysis of code through machine learning. It has made the structure modular whereby it can easily add in various modules of its preprocessing component, feature extraction component, semantic embedding component, and similarity reporting component amongst others. Being designed in terms of scalability, real-time responsiveness, the system can be applicable not only in academic settings but also in the industrialworkflow.

3.1 System Architecture

The offered architectureis based on the modularpipeline which separates each processing step into separately administered functions. First, the pipeline starts with its data ingestion through the REST APIs developed on FastAPI by enabling the submission of code snippets asynchronously and in different programming languages. Kafka message broker accepts submitted data and allows scaling the number of stream processing applications through the entire workload distribution among different consumers. The code is tokenized and normalized by preprocessing services, and syntactic and semantic representations are provided by feature extraction modules with the help of such tools as CodeBERT, Code2Vec, and Graph Neural Networks (GNNs). Such representations are then compared according to their similarity and the results are stored and accessed based on Redis in order to minimize the latency. The results of thereportingmodulescanbeshowninadashboardorAPI endpoints.Toguaranteetheflexibilityofperformanceand deployment, the architecture will be containerized on the totality of architecture with Docker, and with the help of Kubernetes, ensure itishorizontallyscaledrelative tothe systemload.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

Table-1: Technology Used.

Component Technology Used Purpose

APIGateway FastAPI

StreamProcessing ApacheKafka

DataStorage Redis

Asynchronous codesubmission

Parallelcode distribution

Cachingresults andembeddings

Containerization Docker Isolated microservices

Orchestration Kubernetes

3.2 Data Collection and Annotation

Auto-scalingand deployment management

To provide the training and assessment on the largest numberofcases,datasetinvolvedinthepresentstudywas assembled and constituted of three major sources, i.e., academic submissions, industrial repositories, and synthetically generated plagiarism cases. More than 10,000 submissions anonymized by student were gathered after the cooperation with academic institutions of Python, Java and C++. These were assignment files, project code and exam solutions, each of which were characterized in a variety of coding styles and levels of difficulty.

Open-source repositories offered by GitHub and GitLab were mining to track the industrial relevance, and the proprietary codebases were opened on a non-disclosure accordbasiswiththeindustrycollaborators.Thishasbeen incorporated to cover issues that are practical in law like over-licensing and re-use of programs that are illegal in practicalsoftwareprocesses.

To make additional augmentation to the diversity of dataset,thenewlycreatedsyntheticplagiarismcaseswere throughtheuseofautomatedscriptssimulatingreal-word obfuscation strategies. These were renaming variables, rearranging of loops and injecting code on AI, such as GitHub Copilot. Around 20 reviewers were used to implement a consensus approach and tag each sample with the type and severity of plagiarism (verbatim, paraphrased,AI-generated,etc.).

3.3 Preprocessing and Normalization

Thepreprocessingstepisoneofthemostimportantasthe raw code is converted into data, which is understandable by machines, and machine learning. This starts with tokenization whereby source code is split into atomic units including identifiers, keywords, operators and delimiters. This simplify syntactic representation makes

comparisons easy as irrelevant information such as whitespacesandcommentsareeliminated.

Then Abstract Syntax Tree (AST) parsing is used to infer the hierarchical structure of code which represents the logicalflowofcodeidentified.ASTsareidealinidentifying the similarity of operations even where different names and syntax are used. In addition to this, Control Flow Graphs (CFGs) are created to represent the flow of forms of code blocks which indicate hidden algorithmic similarities.

In order to make language independent, there is also a compilation of source code into LLVM Intermediate Representation(IR).Throughthisplatform-agnosticlayer, the model can compare the Python, Java and C++-written code using a uniformed format so that there is no issue with the code having individual features in terms of featureengineering.

3.4 Feature Extraction

Extractionoffeaturesoccursatthesyntacticandsemantic levelsoastoextractalltheimportantfeaturesoftheinput code. In syntactical terms N-gram hashing works on sequences of tokens. Including a separation of code into overlapping chunks of tokens (tokens, e.g. trigrams), the model finds similar or identical chunks of code quickly. Syntactic analysis can be further improved using AST subtree comparisons that are applied to compare code structures, and thus even the change of formats or cosmeticstructureisdetected.

When the extraction of semantic features is to be performed, the transformer-type models (e.g. CodeBERT) will produce contextual embeddings based on the intent andmeaningofcodesnippets.Suchembeddingsaredense vectorrepresentationsofwhichhigh-levelcomparisonsof similarity can be made. Code2Vec takes this one step further by encoding AST paths to sparse vectors, they containlogicaldependenciesbetweencodeentities.

Inordertounderstandthedynamicbehaviorwhichcanbe analyzed by the code functionality and code flow, Graph Neural Networks (GNNs) are trained on Control Flow Graphs and the system can learn dynamic behavior. Also, the methods of contrastive learning are used so that the embeddingsof different programminglanguagescould be aligned. As an example, two versions of equivalent loops (inPythonandJava),canbepositionedinthesamewayin the semantic space, which allows to encounter plagiarism betweenlanguagesproperly.

3.5 Model Development

The plagiarism detecting system works on two key architecturesnamely;theSiameseNeuralNetworksanda TransformerGNNHybrid.siamesearchitectureconsistsof

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

two equal subnetworks, which turn pairs of codes into fixed length vectors. These vectors are compared with each other in terms of cosine similarity and then plagiarism is estimated. This arrangement is particularly effectiveinthecaseofpairwisecomparisonsanditisvery scalable.

Toachievemoresemanticsandstructuralinsight,ahybrid approach is created by combining semantic codebrettas through combining CodeBERT to encode semantics and GNNstoidentifystructural patterns.Combinationsenable thesystemtounderstandthemeaningaswellasthelogic, thus giving it a huge functional capacity to identify even theadvancedformsofplagiarism,suchastheobfuscation ofcodevialogicdistortion.

The training of the model is done with 80, 10 and 10 percent of the training, validation, and testing dataset respectively. To facilitate real-time inference at its full level,quantizationisusedwhichdownsizesthemodeland consumes reduced amount of memory. The large weights of transformers are transformed into the 8 bit format reducing the latency and maintaining the accuracy of the model.

4. IMPLEMENTATION

Development of the proposed code plagiarism detection framework is based on the creation of a very responsive, scalableandmodularsystemthat isina position to make real-time inferences. The system has strong back-end technologiesthatcoupledwithmodernorchestrationtools guaranteesresilienceinthesystemsbothinacademiaand in the industry. The chapter describes the real-time inference pipeline, the scalability architecture as well as theintegrationpathwaystolearningmanagementsystems (LMS)andcontinuousintegration/continuousdeployment (CI/CD)environments.

4.1 Real-Time Inference Pipeline

Themaincomponentofthesystemisaninferencepipeline running in real-time, which is designed to scale on high amounts of code submissions with low latency times. At the pipeline start there lies a bundle of RESTful API endpoints, with /submit as code ingestion endpoint and /detect as similarity endpoint among others. FastAPI is Python-based asynchronous framework with nonblocking I/O operations, that enables thousands of concurrent users to post the code at the same time, withouthavingperformancelags.

Thedatasubmittedascodeisthenfedthroughastreamof Kafka which disseminates the data among various consumers in the system. This allows parallel processing wherein preprocessing, feature extraction and semantic comparison is simultaneously performed concurrently at different worker nodes. Additionally, the modularization

of the pipeline guarantees the ability of the system to support the average response latency of less than 150 milliseconds even under situations when a peak working capacityofthesystemisreached(e.g.,whenexaminations are held or when subject to the large-scale audit). Embeddingtensorsandtokenizedrepresentationsaswell as intermediate results are cached in memory and the framework is able to evade any duplicating calculations whenthesamerequestsarepresentedagainandagain.

4.2 Scalability Architecture

The versatile and cloud-native architecture is used in order to guarantee that the system will be performant with different loads. The whole application is deployed in containers with Docker,and on the cluster of Kubernetes, ensuring horizontal pod auto-scaling. It is an autoprovisioning environment that adds new resources once the level of utilization of the CPU or the memory exceeds a threshold. Another important characteristic revealed during load tests of the system, which were carried out undertheload upto 5,000 simultaneous code uploads, the capacity to scale the system up to 50 inferencepods(whileservingmorethan220requestsper second)withoutcompromisingtheperformance.

The optimization is significantly done by use of Redis caching that saves the frequently accessed code embeds and comparison results. This significantly minimizes inference time on repeated or similar requests, as well as increasingthespeedofresponsestoreal-timeapplications wherethesamecodemightgothroughrepeatedly,e.g.ina classroom usage or in a CI/CD pipeline. Furthermore, the speedofclosestneighborsearchesonembeddingofcodes is attained due to the employment of a high-performance similarity search engine, Faiss, built by Facebook AI. The problemwiththeFaisslibraryisthatitreducesthetimeof comparisonsyntheticallysinceitindexessemanticvectors andretrievestheirbestmatchesinmilliseconds.

Table-2: Scalability Architecture.

Component Technology Used Function

RESTAPI FastAPI Codeingestionand detectioninterface

Streaming ApacheKafka

Parallelprocessing ofcode submissions

SimilaritySearch Faiss Fastnearest neighborretrieval

Caching Redis Storeembeddings andresults

Orchestration Kubernetes

Auto-scalingand resource management

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

4.3 Integration with Academic and CI/CD Systems

The lacking element that is critical in the proposed framework is the fact that it is compatible with both academic systems and industrial pipelines of software. The system can be directly integrated in any Learning Management Systems (LMS) like Moodle, Canvas or Blackboard in educational scenarios. With the help of RESTful APIs, LMS can perform automatic submission of student work to be evaluatedon plagiarismas the part of the assignment grading processes. The establishment of the real-time feedback ability allows instructors to get similarity results and semantic analysis reporting in realtime and thus, prevents academic dishonesty by rewardingfairevaluation.

In the industrial side, the system can work against the CI/CD tools Jenkins, GitLab CI and GitHub Actions. The detection system is set to be automatically activated with developers pushing the code to repositories through webhooksorwithintegrationswithplug-instoenablethe organizations to audit the code containing reused or inaccuratelylicensedpartspriortoitbeingdeployed.That guarantees adherence to the intellectual property rules andminimizestherisksofusinglicensesinfringementasa criminal offense, particularly in the conditions where open-source and proprietary codes may be mixed frequently.

These are the deployment models of the framework, whichcanbedeployedinanon-premisesenvironment,or cloud-based systems of AWS and Azure, in addition to a hybrid model. It provides safe authentication and encryption of API message and access control policies, which makes it appropriate to deal with confidential academicdataandsecretenterprisecode.

5. RESULTS AND EVALUATION

The concept of the proposed real-time code plagiarism detection system was experimentally tested in an intensive way on combination of academic and industrial data. The analysis was done in four main areas, which were the experimental configuration, accuracy, real-time analysis,andcomparisonswithotheravailabletoolsofthe state-of-the-art. These findings corroborate what has already been observed before that the system is an excellent matchwith the requirementsof educational and enterprise settings since it obtained high accuracy, low latencies,andwidelanguagecoverageinPython,Java,and C++.

5.1 Experimental Setup

The experimental setting was set up such that it would reflect a true environment of deployment to emulate, and thatwhichcansupportalotofdatainhighefficiency.The models were trained and inferred on the NVIDIA A100

basedmachinethatcanprovideahighmemorybandwidth and accelerations through deep learning. Deployment tests live were executed on AWS EC2 c5.4xlarge instance tofindagoodcomputationandcostcompromise.Inorder to make the framework universal, the 3 popular programming languages including Python, Java, and C++ wereusedtotestit.Therewere10000+academicpapers and multiple gigabytes of industrial codebases collected on GitHub, GitLab, and proprietary software systems that wereusedasthetestcorpus.

5.2 Accuracy Metrics

The standard classification performance was used in accuracy assessment: Precision, Recall, and F1-Score. Precisionmeasureshow rightthepositive predictionsare (plagiarized code) and the recall is the measure how well the system points out all possible cases of plagiarism. F1scoreisabalancebetweenthetwosoastogiveonevalue.

When evaluated in the academic context with plagiarism being mainly carried out through paraphrase, renaming and partial reuse of a text, the system showed the Precisionof94%,Recallof89%andtheaverageF1-Score of 91%. These measurements imply that they perform very well in detecting different types of unscrupulous submissions and they do not cause many false alarms. In the industrial environments, with the codebases being usually bigger and more complicated, with obfuscated or minified pieces present, the system retained a good level of performance (accuracy between 85 and 91 percent, based on the specific language and the structure of the targetrepository).

Table-3: Accuracy Metrics.

5.3 Real-Time Performance

The most important goal of the system is real-time responsivenessbecauseitallowspresent-timefeedbackto instructors and development groups. The latency and throughputoftheframeworkhasbeenbenchmarkedwith differentamountsofloads.Meanlatencybetweenthecode submission and the retrieval of the results was measured in 142 to 162 milliseconds, with the inclusion of preprocess, feature-extraction, semantic-embedding and similarity-comparison. It was accomplished based on a completelyasynchronouspipelinecoupledwithRedisand quantized inferences of model to decrease processing burdens.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

The system has also proved to be very scalable. It experiencedconstant200220requestspersecondinload testing with pod horizontal auto-scaling and distributed processing implemented by means of Apache Kafka with thehelpofKubernetes.Themetricsprovethatthesystem canbeappliedtothehigh-concurrencysystemslikeonline tests or continuous integration pipelines of the software developmentteams.

Table-4: Real-Time Performance.

Metric Value Tools/Environment

Latency 142–162ms

FastAPI,Redis,Quantized Model

Throughput 200–220req/sec Kafka,Kubernetes

5.4 Comparative Analysis

Inordertovalidatetheexcellenceofthesuggestedsystem, a comparison wasdoneonthesystem with otherpopular systems that are in use: MOSS and standalone CodeBERT. One of the most famous representors of the traditional method of detecting plagiarism, MOSS, based on plagiarism detection via tokens, showed the best results when dealing with direct copy-paste, but hardware obfuscation and generative AI did not work with it because of its syntactic nature. In the test data set MOSS couldonlygetanF1-scoreof68%buttheproposedhybrid system outdid it with an F1-score of 93% which is 25 percent increment in the performance of complex plagiarismcases.

Compared to CodeBERT applied separately which at best scored high or decent in semantic comprehending but at the cost of longer inference time, the model framework proposed was three times faster in inference but still possessing an accuracy score of 98 percent of what CodeBERT score originally. The possibility to do it is due to quantization, GNN-based control flow analysis, and embeddingcachingtechniques.Faiss andRedisoptimized the process enough to provide near-real-time results and thecombinationofthecontextualpoweroftheCodeBERT and the structural reasoning of GNN allowed the robust semanticdetection.

Table-5: Comparative Analysis.

MOSS

Failswith paraphrasing orAIgenerated code

accuracy intensive

Proposed System 93% 150ms Fast, scalable, semantic + structural Slight overheadfrom hybridmodel

These comparisons reinforce the effectiveness of the proposed model in striking a balance between accuracy and real-time performance two metrics often at odds in existingplagiarismdetectiontools.

6. CASE STUDIES

In order to evaluate the feasibility of the proposed real time code plagiarism detection system, two case studies weredoneoneofthemintheacademicenvironmentwere student submissions were done and the second based on industrial environment open-source license compliance. These case studies give indication of the performance of this system in the real world environment and not only does it give the technical proofs, it gives the vision and adverse effect which the system may have on academic integrityandtheprotectionofintellectualproperty.

6.1 Academic Use Case

This academic case study was conducted together with a partner university where the plagiarism detection system was used in two undergraduate sizes of programming such as one in Python and another one in Java. The collection of all data was composed of 5,000 student assignments, mini-projects and exam papers. The main purpose was to check how the system option to find the AI-generated and plagiarizedcode works withinanactive learning classroom and record its impact on the students inthelongrun.

In the semester, the system detected 120 submissions to be generated byanAI, mostlybecause of using toolssuch as GitHub Copilot. Semantic abnormality, i.e., unnatural naming rules of identifiers, repetitive structure, and the absenceofthehuman-likevariationwasdetectedbetween them.Theprecisionofmorethan95percentwasrecorded inidentifyingkeyorimportantfactorsthatwereheldtrue by instructors after carrying out a manual analysis of the framework.Moreover,sinceitbecamepartoftheLearning Management System (LMS) in real-time, then any class instructorcouldfeel supportedsinceitofferedimmediate responses to him or her so that timely corrections could beinstilled.

Because of the presence of the plagiarism detection framework,thetotalacademicdishonestywasreducedby 30 percent as compared to the earlier semester. Students said they were more careful in using illegal devices or obtaining code of peers as they realized the possibility of

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

discovering similarity in both structure and meaning was veryhigh.

Table-6: Academic Use Case.

This case study highlights the framework’s effectiveness notonlyasa technical toolforidentifyingmisconductbut also as a behavioral deterrent. By providing transparent and immediate feedback, the system helps preserve the integrityofassessmentsandencouragesgenuinelearning.

6.2 Industrial Use Case

In the industrial-example, the plagiarism detection framework was implemented in one of the Fortune 500 software firms during an inventory of their in-house source code written in C++ of over 10 million lines. The idea was to detect possible infringement of open-source license, especially the GNU General Public License (GPL), whose terms require derivative works of GPL-licensed programstoalsobeopen-source.

The framework accessed similarity scans through hundreds of thousands of scanned open-source code snippets through the system using the capabilities of the system to conduct a cross-language semantic analysis of open-source code using libraries and integration with the publiclyavailablememoryrepositoriessuchasGitHuband GitLab. The given combination of the Faiss-based embeddingsimilarityengine,CodeBERT,andGNNmodels identified 45 probable suspicious parts of code. There were15segmentsthatwerefoundtobeabreachofthe15 segments after a comprehensive legal and manual review that included unauthorized reuse of GPL-licensed components.

TheinclusionofoneofthesignificantfindingswasaJSON parser which was initially created under GPL license and was replicated in the proprietary module of the company without appropriate acknowledgment and conformity. Otherwise, this would have led to lawsuits and fines in case it was not found out. According to the rough estimates given by the lawyers, early detection of these problems saved the company chances of facing a quite unfavorable result in the form of penalties exceeding $2 millionbesidesmaintainingtheirreputationofcompliance infurtherreleasesoftheirproducts.

Table-7: Industrial Use Case.

Thepresentedcasestudyshowsthattheproposedsystem can be useful when testing code at an enterprise level. Very large and wide-ranging repositories of semantically similar code and rapidly scaled speeds of performance define the use of the system as a worthwhile tool in the protection of intellectual property and the prevention of legalriskinindustry.

7. CONCLUSION AND FUTURE WORK

7.1 Conclusion

In this study, a real-time framework in code plagiarism detectionwasdevelopedandconsistsofNaturalLanguage Processing (NLP), Machine Learning (ML)and scalable system design. Trying to overcome the drawback of the traditional plagiarism software, which mostly relies on a syntactic similarity check, the proposed model incorporated the semantics analysis via CodeBERT embeddings, the structural aspect with the Graph Neural Networks (GNNs), and syntactic check with token-based and AST-based checks. With the help of such a hybrid approach, the system managed to identify such complex patterns of plagiarism as paraphrased code, AI-submitted scripts,andcross-languagematch.

Thearchitecturewasappliedina formof modular,cloudnative solution, deployed on FastAPI, Kafka, Redis, Faiss, Docker, and Kubernetes, which allows its scalability and real-time work. According to the experimental findings, the accuracy was good making it useful in both academic and industrial application with F1-scores as high as 91 percent, latency taking less than 150 milliseconds and witha throughputofmorethan200requestsina second. Its effectiveness was proven by case studies: it cut down the rates of cheating in academia by 30%, in industry it detectedlicenseabusethatsparedoneoftheFortune500 companiesasumoflegalfinestotalingto2milliondollars. Suchresultshighlightthecapacityofthemodeltoachieve tradeoffs between precision, scalability and efficiency of deploymentoverheterogeneousenvironments.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

7.2 LIMITATIONS

Harvey Framework has quite a number of limitations despite its good performance. Another key issue is its behavior with the low-resource programming languages, e.g. Rust, Go, or even domain-specific scripting languages, whichare underrepresentedin the training datasets. This lowersthegeneralizabilityofthesystemtonewlanguages or less frequently used languages except that additional trainingdataispreviouslyobtainedandcurated.

Also,inspiteofapplyingquantizationandcachingtricksto optimize the model, it is GPU demanding, in particular, at the time of semantic embedding generation and GNN processing. Organizations or institutions that have little access to a sophisticated computing infrastructure might not be able to implement the system to a large scale without losing the performance level. In addition, the necessity to compile code into intermediate representations(LLVMIRorthelike)todocross-language analysis introduces complexity to preprocessing and is sometimes simply out of reach in restricted development environments.

7.3 FUTURE WORK

Moving ahead, the present system has a number of avenues through which it can be improved and made bigger. The direction of edge deployment represents one such area of potential significance (the model has a lightweight and quantized version that can run on decentralized hardware, e.g. laptops or on-premises servers).Thiswouldenablethesystemtobemoreopento the education sector and smaller organizations that have fewer cloud resources at their disposal, and the latency wouldalsobefurtherlessersincetherewillbenoneedto bedependentoncentralizedservers.

Multilingual training is another burning spot of development. Inclusion of wider set of programming languages and computer programming dialects to the corpusoftraining data will makethesystem more robust across a variety of software ecosystems. That encompasses such instances as adding more recent languages such as Kotlin, Swift, or scripting languages specializinginbioinformaticsorblockchainprogramming.

REFERENCES

1. A. Aiken, “MOSS (Measure of Software Similarity),” [Online]. Available: https://theory.stanford.edu/~aiken/moss/. [Accessed:Jun.10,2025].

2. L. Prechelt, G. Malpohl, and M. Philippsen, “JPlag: Finding plagiarisms among a set of programs,” Universität Karlsruhe, Fak.fürInformatik, Tech. Rep. 2002-1,2002.

3. T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multilinguistic token-based code clone detection system for large scale source code,” IEEE Trans. Softw.Eng.,vol.28,no.7,pp.654–670,Jul.2002.

4. C. Roy and J. Cordy, “A survey on software clone detection research,” Queen’s University, Tech. Rep. 541,2007.

5. M.Ducasse,L.F.Pouzet,andP.Pons,“Usingsoftware metricstoclassifysourcecode,”J.Syst.Softw.,vol.79, no.7,pp.952–964,Jul.2006.

6. S. Schleimer, D. S. Wilkerson, and A. Aiken, “Winnowing: Local algorithms for document fingerprinting,”inProc.2003ACMSIGMODInt.Conf. Management of Data, San Diego, CA, USA, 2003, pp. 76–85.

7. M. Joy and M. Luck, “Plagiarism in programming assignments,” IEEE Trans. Educ., vol. 42, no. 2, pp. 129–133,May1999.

8. U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “Code2Vec: Learning distributed representations of code,”inProc.ACMPOPL,Jan.2019,pp.40:1–40:29.

9. S. Wang et al., “CodeT5: Identifier-aware unified pretrained encoder-decoder models for code understandingandgeneration,”inProc.EMNLP,Nov. 2021.

10. F. Zuo et al., “Neural machine translation inspired binary code similarity comparison beyond function pairs,”arXivpreprintarXiv:1808.04706,2018.

11. M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” in Proc.ICLR,2018.

12. Y. Lu et al., “CodeXGLUE: A benchmark dataset and open challenge for code intelligence,” arXiv preprint arXiv:2102.04664,2021.

13. M. Zakeri-Nasrabadi et al., “A systematic literature review on source code similarity measurement and clone detection,” arXiv preprint arXiv:2306.16171, 2023.

14. H. Patil et al., “Code plagiarism and originality detection using machine learning,” Int. J. Intell. Syst. Appl.Eng.,vol.12,no.3,pp.209–215,Mar.2024.

15. F. Ebrahim and M. Joy, “Source code plagiarism detection with pre-trained model embeddings,” in Proc.RANLP,Sep.2023,pp.301–309.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 06 | Jun 2025 www.irjet.net p-ISSN: 2395-0072

16. S. Surendran, “Plagiarism detection in source code using machine learning,” M.S. thesis, Univ. Windsor, Canada,2024.

17. G. Mostaeen et al., “A machine learning-based framework for code clone validation,” arXiv preprint arXiv:2005.00967,2020.

18. O. Karnalim, “TF-IDF inspired detection for crosslanguage source code plagiarism,” Comput. Sci., vol. 21,no.1,Jan.2020.

19. R. Franclinton, O. Karnalim, and M. Ayub, “A scalable code similarity detection with online architecture,” Int. J. Online Biomed. Eng., vol. 16, no. 10, pp. 40–52, 2020.

20. E.Hosametal., “Classificationfeaturesets forsource code plagiarism detection in Java,” J. Eng. Appl. Sci., vol.69,no.1,Nov.2022.

21. M. M. Zahid et al., “An efficient ML approach for plagiarism detection in text documents,” J. Comput. Biomed. Informatics, vol. 4, no. 2, pp. 241–248, Mar. 2023.

22. O. Kamat et al., “Plagiarism detection using machine learning,” arXiv preprint arXiv:2412.06241, Dec. 2024.

23. B. Gupta and A. Arora, “Real-time plagiarism detectionsystemforprogrammingassignmentsusing AST,”inProc.ICIIP,2017.

24. A. Sharma and R. Kumar, “Token-based hybrid plagiarismdetectionsystemusingmachinelearning,” Int. J. Comput. Appl., vol. 182, no. 40, pp. 32–37, Apr. 2019.

25. S. Lee and D. Kim, “Automatic identification of code reuse and intellectual property violations in large repositories,” in Proc. ICSE, May 2021, pp. 1082–1093.

26. V. Saini, S. Ghosh, and Y. Zhao, “Detecting license violations in open source software,” Empirical Softw. Eng.,vol.25,pp.621–647,2020.

27. A. Bakar, S. Ghani, and N. M. Noor, “Software plagiarism detection: A review of techniques and tools,”J.Theor.Appl.Inf.Technol.,vol.95,no.22,pp. 6157–6166,2017.

28. R. R. Ramalingam and P. V. V. Kishore, “Automated code plagiarism detection using neural networks,” in Proc.ICCCNT,2022.

30. X.Xu et al.,“Neural network-basedgraphembedding for binary code similarity detection,” arXiv preprint arXiv:1708.06525,2017.

31. P. Sheard et al., “Upholding academic integrity in programming courses with automated code review,” inProc.FIE,Oct.2018.

32. J. Wang et al., “An end-to-end system for scalable source code similarity detection,” in Proc. MSR, May 2022,pp.105–116.

33. N. Islam et al., “AI-generated code and academic dishonesty: Detection and implications,” J. Ethics Educ.Technol.,vol.4,no.2,pp.12–26,2024.

34. A. Mehta and K. Shukla, “Cross-language plagiarism detection using IR-based models,” in Proc. ICSC, Jan. 2025.

29. L. Sulistiani and O. Karnalim, “ES-Plag: Efficient and sensitive plagiarism detection tool for academic environments,” Comput. Appl. Eng. Educ., vol. 27, no. 1,pp.166–182,Sep.2018.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
Real-Time Code Plagiarism Detection Using NLP and Machine Learning for Academic and Industry Applica by IRJET Journal - Issuu