Credibility Analysis of Online Information Contents Using Ensemble Methods: Machine Learning Approaches

Farhin Bano, Priyanka Meel

Farhin Bano, Priyanka Meel

Abstract

News on social media comes with two facets just like the coin. On one side it has the positive aspect of easy access with ease in spread on the other side due to the huge volumes of news articles; information present in those articles becomes difficult to verify. One can easily spread his/her propaganda through the internet making the viewer think that whatever they are reading are facts and not misinformation and disinformation. In cases of natural calamity or disasters these fake new further fuel on the anger and anxiety of the common people. There are three aspects of fake news that we need to consider in order to predict the accuracy: the semantics of the spurious news, the rate at which it was spread i.e. the network undercurrents and the section of the people affected by that news. To tackle this problem we have come up with the architecture which consists of ensemble learning models namely; bagging methods which comprises of Random Forest Classifier and Extra Tree Classifier, boosting methods which encompasses of AdaBoost and Gradient Boosting, voting methods which covers of hard and soft voting in which we have incorporated three classifiers which are Naïve Bayes, Random Forest and Extra Tree Classifiers. Apart from the ensemble learning approaches we have also tested two independent classifiers namely; Decision Tree Classifier and Naïve Bayes Classifier. We have used three data corpora for our experiment which are; Kaggle’s Fake News Detection and real_or_fake dataset along with all_data. We execute each of the classifiers on each dataset and compare their accuracies to classify each news articles. The best classifiers on all the three datasets are; the Ada Boost classifier classified with the accuracy of 97.67% on Fake News Detection dataset whereas the Voting classifier (with soft voting) had the accuracy of 90.68% and 91.81% on the real_or_fake and all_data datasets respectively