Threshold calculation and its evaluation for finding out similar documents and frequent terms

  • Prafulla B. Bafna, Jatinderkumar R. Saini

Abstract

Choosing the correct threshold for a particular application is significant because it fixes the correctness of the accepted results. The proposed approach identifies the threshold values to construct Document Term Matrix for Marathi (DTMM), Document Synset Matrix Marathi (DSMM) and cosine measure to decide the similarity between Marathi text documents. Marathi is inflectional & free word order language, so it is difficult to perform Natural language processing (NLP) activities on Marathi text.  The corpus of more than 1300 documents including nearly 550 proses and 750 verses is considered having total tokens nearly 3 lakh.

261 Confusion matrices are designed to choose the right threshold. Different evaluation parameters like accuracy, precision, F1-measure and Matthews correlation coefficient (MCC) which is considered to be best for binary classification are used to prove the efficacy of thresholds.

Published
2020-06-06
How to Cite
Prafulla B. Bafna, Jatinderkumar R. Saini. (2020). Threshold calculation and its evaluation for finding out similar documents and frequent terms. International Journal of Advanced Science and Technology, 29(04), 4024 -. Retrieved from http://sersc.org/journals/index.php/IJAST/article/view/24571