Comparative Analysis of Similarity Measures for Extraction of Parallel Data
Abstract
Similarity and distance measures compute the similarity of two documents/sentences into single numeric value and brings out the degree of parallelism or distance from one another. A number of similarity measures have been used by the researchers but their effectiveness is not very clear. Selection of right similarity measure is crucial to the performance of translation tasks and extraction of parallel data. In this paper we have analyzed and compared the performance of four similarity and distance measures. Specifically we have done empirical analysis of Cosine Similarity, Jaccard Coefficient, Hamming Distance and Euclidean Distance.