Document Type-based Categorization using Machine Learning
Digital library stores heterogeneous types of documents in digital format like eBooks, research papers, thesis, etc. But after storage, main problem is efficient retrieval as per user's needs. It is challenging to search document of specific type. Usually documents are manually classified in digital libraries. If machine learning based automated document classification is applied, it saves a lot of time, resources and improves retrieval performance. In that case, it raises more challenges like feature identification from full text documents and choice of efficient algorithm. In this paper, these issues are discussed and in order to overcome them work has been carried out by discovering suitable features. Based on these features, eight machine learning models are trained and tested. Performance of these eight models is evaluated and found that decision tree algorithm produces most promising result. Further experiments show that document categorization performed while indexing, adds intelligence to search and significantly improves retrieval performance.