A Comparative Study on Classification Algorithms Using Different Feature Extraction And Vectorization Techniques For Text

Main Article Content

Nadia Anjum, Dr Srinivasu Badugu

Abstract

We live in a world where information has a great value and the amount of information available in the text document has risen so that identifying those that are important to us becomes an issue. Because of this data, divided into categories, the user is able to navigate to the information he wants to obtain. Texts are most of the data and here text classification comes to the scene.The aim of this paper is to classify the documents automatically into their classes by comparing different feature extraction and Vectorization techniques. Classification of document requires machine learning (ML) techniques. The ML techniques that we have employed to classify the documents are Support Vector Machine (SVM), Naïve Bayes (NB). The various feature extraction techniques that we have implemented are Stemming and Lemmatization and we note how the algorithms differ in performance when implemented each of the feature extraction technique and vectorization approaches. We used two vectorization techniques, such as vectorization of count vector and vectorization of term frequency inverse document frequency (TF-IDF). The results prove that according to the type of content and metric, the performance of the feature extraction and vectorization methods are contrasting; in some cases are better than the others, and in other cases is the inverse..

Article Details

Section
Articles