Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 12 | Dec 2025

p-ISSN: 2395-0072

www.irjet.net

Robust K-Means Clustering: A Unified Framework for Outlier Removal and Adaptive Distance Metrics Md. Mayn Uddin Dept. of Electrical and Electronic Engineering, Jatiya Kabi Kazi Nazrul Islam University, Trishal, Mymensingh2224 , Bangladesh ---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Outliers and inappropriate distance metrics

Specifically, the procedures are as follows: 1. The set of rules randomly selects the middle of gravity for every cluster. For example, in case you pick out 3 "k" s, the set of rules will randomly pick out 3 centroids. 2. K means maps all of the facts with inside the dataset to the nearest centroid. That is, an expertise factor is taken into consideration to be with inside the decided on cluster if it's far toward the middle of gravity of the cluster than the exchange middle of gravity. 3. For every cluster, the set of rules recalculates the centroid via means of getting the not unusual place cost of all factors with inside the cluster, decreasing the whole variance with inside the cluster with recognition to the relevance of the preceding step. When the centroid changes, the set of rules reassigns factors to the closest centroid. 4. The set of rules repeats the centroid calculation and factor allocation till the whole distance among the expertise factor and the corresponding centroid is minimised, the most variety of iterations is reached, or the centroid cost does now no longer change.

remain major challenges in the successful improvement of the accuracy of K-Means clustering. Outliers distort centroid estimation, while conventional distance measurement methods often fail to capture true similarity among data points. This paper proposes a unified modification to the KMeans algorithm that combines systematic outlier detection and elimination before centroid calculation, with a novel adaptive distance metric for cluster assignment. By reducing the impression of anomalous data and refining similarity measurement, the modified algorithm achieves faster convergence and significantly higher clustering accuracy. An exploratory evaluation on nine benchmark multivariate datasets demonstrates up to 81% improvement in clustering performance compared to traditional K-Means. The proposed study using the python coding emphasizes the significance of robust preprocessing and metric design in unsupervised learning, and its practical implementation is illustrated using Python.

Key Words: K-Means Clustering, Outliers, Outliers removal, Adaptive Distance Function.

1.1 Background

1. INTRODUCTION

Within the lion's share of information sets there are exceptions. This predominance of exceptions is indeed more noticeable in huge datasets since these are regularly assembled through a few computerized frameworks. This infers that there's likely no one physically checking for irregularities. And modern-day detecting frameworks ordinarily support simple gathering information over exactness. Confirmation is commonly the first costly portion, and other people assume that we are going to handle them later. In common exceptions are information designs that go astray from the quality or assumed conduct from the leftover portion of the data [1]. Given this definition, outliers are not one or the other neither loathsome nor extraordinary things; they’re fair outliers, discovery (fund), illicit chasing or deforestation (normal sciences), alter in society’s conduct (social sciences), among other exercises. In any case, those reasons have something in common: they're all intrigued. The interestingness or real-life significance of exceptions may well be a key highlight of peculiarity [1].

K means is an unsupervised clustering algorithm designed to divide unlabeled data into individual groups of selected numbers (that is, "K"). In other words, k means finds observations that share important functions and classifies them into clusters. An honest clustering solution is a solution that finds clusters so that the observations in each cluster are more similar than the cluster itself. In K Means, each cluster is at the average center of gravity (called the "center of gravity"). Re-presented. The dem cluster has assigned an information point. Centroids are also information that represents the center (mean) of the cluster and do not necessarily have to be members of the dataset. In this way, the algorithm runs an iterative process until each piece of information is closer to the center of gravity of its own cluster than the center of gravity of another cluster, minimizing the distance between the clusters at each step. For example, setting "k" to 2 group’s records into two clusters, and setting "k" to 4 groups of knowledge into four clusters. K Means begins off evolved the system with a randomly decided on facts factor because the proposed centroid of the group, iteratively recalculates the brand new centroid, and converges at the very last clustering of facts factors.

Impact Factor value: 8.315

ISO 9001:2008 Certified Journal

Page 1232