News Categorization of NYT Articles using POS Tagging and TF-IDF Approach
Abstract
As much as there is a need for more data in the quest of seeking intuitions in the advancements of our world today, there is also a need to structurize the data so as to provide an instinctive and a more reliable source to refer upon. Various domains have different means in organizing data, may it be in the field of automotive, transport, academe, and the like. The proposed study focuses on organizing data in the field of media, specifically the news. The NYT news articles are classified into Sports, National, and Business groups. Categorization is done using preprocessing of news headlines, followed by filtering through POS combinations and TF-IDF feature weighting. SVM, Naïve Bayes, and Logistic regression classifiers are utilized for categorization. The results imply that SVM outperforms other classifiers. In addition, the combination of nouns and adjectives prove to contain the most amount of information content needed for appropriate classification of news headlines.