Construction of Paraphrasing Dataset for Punjabi: A Deep Learning Approach

Arwinder Singh, Gurpreet Singh Josan

Arwinder Singh, Gurpreet Singh Josan

Abstract

Paraphrasing is an emerging trend in Nat- ural Language Processing but it is facing a challenge of lack of parallel corpora. There are various paraphrasing datasets available but paraphrasing dataset for Punjabi is missing. The aim of this paper is to create a publicly available paraphrasing dataset for Punjabi. We present an automatic ap- proach to collect large-scale phrasal and sentential paraphrases from newspapers if they publish same event of the same date. Our hypothesis is that two phrases or sen- tences are seen as paraphrases when the distance between their vectors is minimum. So, the headlines and articles are processed with neural network approach to represent phrases and sentences as vectors. The sim- ilarity metrics have been applied on these vectors to find similar pairs. The collected pairs are manually evaluated with 88% ac- curacy for phrasal paraphrases and 70% for sentential paraphrases. The proposed ap- proach gather 1,14,459 phrasal and 75,411 sentential paraphrases. This paraphrasing dataset can be used in paraphrase genera- tion, detection or text summarization.