Automated Stop-Word List Generation for Dogri Corpus

Sonam Gandotra, Bhavna Arora

Sonam Gandotra, Bhavna Arora

Abstract

Stop-word identification and its elimination are an important pre-processing task for any natural language processing task. These words contain no significant information and also do not contribute to the informative index of the document. Index terms play a significant role in various text-based applications like information retrieval, question-answering, summarization, etc. The presence of stop-words makes it difficult to identify these index terms. Hence, these need to be eliminated. In this paper, a simple frequency-based approach is proposed for the identification of these stop-words for Dogri Language. The proposed algorithm uses the frequency of occurrence of words and in order to boost the correctness of retrieval and minimalize the probability of occurrence of index terms in the generated list, a named-entity list is also used. This resulted in the generation of the stop-word list consisting of 155 stop-words as presented in the paper.

Keywords: Dogri Language, Stop-word, Term-frequency, Statistical Approach, Named-entity, Index terms, Information Retrieval, Frequency-based approach