On the Development of Novel Arabic Documents Classifier

Dr. Fouad Jameel Ibrahim Al Azzawi, Dr. Boumedyen Shannaq

Dr. Fouad Jameel Ibrahim Al Azzawi, Dr. Boumedyen Shannaq

Abstract

Classification of documents is a prominent field in machine learning and data mining. As a major world language, Arabic text has taken its share in research and development throughput. In this work, a new Arabic document classifier has been developed and evaluated. The main objective is to improve Arabic documents classifier efficiency. The probability function of inductive construction (CSV) has been implemented on the basis of the Naïve Bayes approach for developing the proposed classifier. Arabic dataset has been collected and filtered from the ‘Corpus of Contemporary Arabic (CCA)’. The CCA contains over 843,000 tokens in 416 documents covering a comprehensive series out of 43 categories. Two experiments have been developed and executed. The first experiment shows that the Naïve Bayes classifier outperformed the other seven classifiers used by researchers in the literature. The second experiment shows that the updated approach (proposed classifier) outperforms the Naïve Bayes classifier with 40% correctly classified instances and with the optimal time taken to build the model. The proposed approach improves the Naïve Bayes classifier to reach 93% average classification accuracy rate. It has been found that the Naïve Bayes classifier accuracy remains stable without text preprocessing operation. The results obtained are great benefits for improving the classification process of Arabic Corpus and the search engine performance.