Analysis of Similarity Measures with WordNet Based and Enhanced Feature Selection in Text Document Clustering

Venkata Nagaraju Thatha, A. Sudhir Babu, D. Haritha

Venkata Nagaraju Thatha, A. Sudhir Babu, D. Haritha

Abstract

In the current scenario, a large amount of text is generated from different resources like digitized libraries and world wide web. With these continuous growing of text, it is necessary to organize the text documents depending on the need of the user. Based on the need text document clustering comes into play. Document clustering partitions the large amount of data into smaller manageable clusters. Traditional approaches use word formsand statistical features to cluster the documents. So, the documents with in a cluster are not conceptually similar to one another. To overcome these problems, we model the document clustering in such a way that documents grouped together based on the similarity of the concepts. For that purpose, Ontology is involved in the document clustering. The proposed model first identifies conferences present in each document. By using Semantic similarity and WordNet the problems of synonymy, polysemy is handled as by identifying suitable meaning of the word based on the condition. In this paper, three most popular similarity measures such as cosine, Jaccard coefficient and Pearson correlation coefficient are compared by using enhanced k- means algorithm in both frequency count and DFS-EM representations of the documents.

Keywords: similarity, DFS-EM, k-means, ontology, WordNet, semantic similarity.