A New Supervised Term Weight Measure for Text Classification
Text classification is used to classify textual documents into known categories. The main aim of text classification is to identify the category of an unknown document. The researchers proposed several approaches to text classification problem. In text classification approaches, the document representation is one important step to increase the performance of the approach. In general, Vector Space Model is popularly used by several researchers in the text classification approaches for document representation. Selection of the features for representing the document vector is one crucial step in the text classification. The researchers are experimenting with different weight measures to specify the importance of the features in the representation of document vectors. In this work, we proposed a new weight measure to calculate the feature weights. The machine learning algorithms produce the classification models by using the vectors of documents. Random Forest and Naive Bayes Multinomial algorithms are used in this experiment for producing the classification model. The category of a new document is predicted by using this model. Two popular benchmark datasets such as 20-Newsgroup and Reuters-21578 are used for experimentation. The new term weight measure attained good accuracy for text classification when contrasted with popular techniques of text classification.
Keywords: Text Classification, 20-Newsgroup, Reuters-21578, Term Weight Measures, Machine Learning Algorithms.