Skip to main content

STRATEGIES FOR CURATING HIGH-QUALITY DATASETS TO TRAIN EFFECTIVE ML MODELS

Page 1

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 11 Issue: 05 | May 2024

www.irjet.net

p-ISSN: 2395-0072

STRATEGIES FOR CURATING HIGH-QUALITY DATASETS TO TRAIN EFFECTIVE ML MODELS Senthilbharanidhar BoganaVijaykumar Bharathiar University, India -------------------------------------------------------------------------***-----------------------------------------------------------------------ABSTRACT: Data quality and relevance significantly impact the performance of machine learning (ML) models. This article discusses the importance of data collection, cleaning, preprocessing, and model evaluation metrics in ML workflows. We explore various sampling techniques and their applications, addressing challenges associated with imbalanced or insufficient datasets through resampling methods such as the synthetic minority over-sampling technique (SMOTE) and bootstrapping. The article emphasizes the importance of many people using model evaluation metrics like accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC) to check how well-trained ML models work and how well they can generalize [5], [8]. Effective data collection involves gathering relevant information from diverse sources, ensuring representativeness and variety [6]. Data cleaning and preparation, including handling missing values, outliers, and feature scaling, are crucial steps in preparing data for ML model training [7]. Keywords: Data Quality, Sampling Techniques, Imbalanced Datasets, Data Preprocessing, Model Evaluation Metrics

INTRODUCTION: Many people use model evaluation metrics like accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC) to check how well-trained ML models work and how well they can generalize [8]. However, imbalanced datasets and insufficient data pose challenges to building accurate and reliable models.

© 2024, IRJET

|

Impact Factor value: 8.226

|

ISO 9001:2008 Certified Journal

|

Page 2111


Turn static files into dynamic content formats.

Create a flipbook
STRATEGIES FOR CURATING HIGH-QUALITY DATASETS TO TRAIN EFFECTIVE ML MODELS by IRJET Journal - Issuu