Similarity Search based on Text Embedding Model for detection of Near Duplicates

Asha Rani Mishra, V.K Panchal, Pawan Kumar

Asha Rani Mishra, V.K Panchal, Pawan Kumar

Abstract

Large amount of information in the form of text data is available to us which is acquired from various sources and stored properly for future use. There is an urgent need of finding a way which extracts important information in short period of time in an effective way. One of the challenging issues that arises while integrating data from multiple sources is the detection of near duplicate text which hinders efficient retrieval of textual information. Near duplicate text contents led to improper utilization of resources in terms of storage and time. Similarity measurement in textual data plays a great role in many text-oriented tasks like information retrieval, text classification, text clustering and text summarization. In this paper, we have tried to evaluate various text similarity measurement techniques to calculate similarity score for near duplicate detection between two texts.