An Optimal Data Preparation and Feature Extraction Methodology for Classification Algorithms

Main Article Content

Rakshitha kiran P, Dr. Naveen N C


Data preprocessing is a most important stage in data mining which is often neglected.  This stage involves transforming the raw data into readable format. The real world data tends to be noisy, incomplete, inconsistent and lacks certain behavioral patterns. It is very important to preprocess such data before using for analysis. This paper summarizes various data preprocessing methodology for PCOS (Polycystic Ovary syndrome) datasets. PCOS is a common hormonal problem faced by ladies in the age group of 19-35’s. Initially the PCOS dataset is preprocessed by preprocessing methods like Multiple Imputation, Discretization method which converts data into into discrete values, Standard scaler, Min-Max scalar methods are used for feature scaling, RobustScaler() to used remove the outliers. After the preprocessing stage feature extraction procedure is carried out where the most relevant features are extracted. Then the data sets are classified using various classification techniques like  K Nearest Neighbors (KNN), Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM) and Artificial Neural Network (ANN) are used. The classification model is evaluated for accuracy, precision, recall and F1-score performance metrics. This paper compares the model performance with and without data preprocessing stage and also with feature extraction. And the results have proved that the preprocessing method with feature extraction technique has significantly improved the model performance. 

Article Details