Development of Multilingual Social Media Data Corpus: Development and Evaluation

Fitrah Rumaisa; Halizah Basiron; Zurina Saaya; Noorli Khamis

Fitrah Rumaisa
Halizah Basiron
Zurina Saaya
Noorli Khamis

Abstract

The purpose of this study is manual annotating,a corpus for Bahasa Indonesia and Bahasa Melayu. Corpus for both languages has been made by many researchers before, but the focus of this research is only on words with the same vocabulary but which have very different meanings. The data used is obtained from social media, so informal words are found. As many as 2100 words for each language which were then randomly selected so that it found 300 words with the same vocabulary but had different meanings. The objective of this study is to confirm that this condition can influence the results of polarity sentiment. At the end of this paper, we will show the results of the influence of the conditions of the two languages on the polarity of sentiments. From the manual annotation, an annotation agreement test is made bythreeBahasa Indonesia annotators and threeBahasa Melayu annotators. The results of the annotation found that there are 63 out of 300 words experience different polarity. Results of score agreement among annotations for each language show that there are good agreement among the annotators during annotation process.