A Brief Survey on Text Document Classification

Vishnu Panickar, Priyanka Kashyap, Ashish Kawale, Sujit Pradhan, Nihar M. Ranjan

Vishnu Panickar, Priyanka Kashyap, Ashish Kawale, Sujit Pradhan, Nihar M. Ranjan

Abstract

Data is considered as the backbone of IT sector. Every day a lot of data is being generated across the internet, and most of this data is in unstructured form. Such as data collected from various social media sites like Facebook, twitter etc. The data collected from these sources can be vital for different organizations for their business. These data are stored in the form of large document files. Maintaining, retrieving and organizing these data is very difficult as what actually the document contains is not known. It is not practically possible to manually read each and every file and assign a label to the document. To make this process easier and efficient document classifier can be used. A text classifier basically analyses the data, and tags it with a suitable label that matches its contents. Document classification is one of the branches of text classification, where the classifier is able to tag a suitable class to the document from a list of predefined classes, which makes the process of organizing and maintaining the files in a better way. Document classification can be used in the field of library science where a lot of data from various fields are stored for analysis and decision making. Traditional machine learning techniques for text classification are relying on the bag of words representation of documents to generate features in which they are simply ignoring context. Moreover, these models require lexical features like unigram, bi-gram or n-grams to mark their presence /absence in the labelled documents. There are some serious issues arise by using these types of feature representation such as data sparsity problem and curse of dimensionality. Therefor to overcome these limitations We will use Neural Networks for building the classification model, Neural Networks has an edge over other traditional classification models due to its effective feature extraction methods and its ability to maintain a co-relation between the words.