Performance Analysis of Text Document Clustering Technique for Gurmukhi Script using Punjabi Monolingual Text Corpus (ILCI-II) Data-Set

Mukesh Kumar, Amandeep Verma

Mukesh Kumar, Amandeep Verma

Abstract

Text document clustering is an unsupervised machine learning technique that seeks to identify homogeneous groups of data-sets based upon its attributes. Each group of homogeneous data-sets is known as a cluster that represent more similarity among its own data elements and represent dissimilarity with the data elements of other clusters. This paper presents the performance analysis of proposed hybrid clustering technique for Gurmukhi script i.e. “Fuzzy Term Weight based document clustering technique for Gurmukhi script”. The performance of proposed hybrid text document clustering technique is tested using standard data-set “Punjabi Monolingual text corpus (ILCI-II)” and represent better performance in terms of creation of meaningful cluster titles, and expected F-score.

An efficient clustering approach must be insensitive to the outliers as well as to the order of input data. Here, it becomes necessary to define clusters of input text documents in some way and to find regularities. Here it becomes more important to ascertain how the clusters of similar text documents are defined and how a technique is measured against its performance and against clustering accuracy. In this experimental analysis, the performance of resultant outcomes of proposed clustering technique and the existing clustering technique have been evaluated using the standard validation measures namely F-measure, Precision, and Recall.