Data Collision of incomplete unlabelled ambiguous big data investigation

G.Senthil Velan , D.Syed Ali , T.Yuvarani , P. Dineshkumar , C.Tamilselvi

G.Senthil Velan , D.Syed Ali , T.Yuvarani , P. Dineshkumar , C.Tamilselvi

Abstract

Nowadays the internet is stormed with huge amount of data. Massive amount of data from social
medias like face book, twitter and also large amount of financial reports are collected every day. But
there is no assurance that this data are free from noise. These data are always incomplete and
inconsistent which will leads to uncertainty. If the amount of data increases so as the uncertainty
arising from the data also increases. This uncertainty may arise surrounding the 5 big data
characteristics (5v’s namely variability, viscosity, validity, and viability and volume). Also the latest
artificial intelligence techniques like machine learning, natural language processing, and
computational intelligence provides more efficient and scalable data than the traditional big data
analytic techniques. But these techniques also suffer from uncertainty in data sets. So to overcome this
summary statistics method is followed which will calculate mean and variance from the distributed
sample data thereby reducing the uncertainty caused due large amount of vague data. Also
distribution based decision tree method is also applied which reduces the problem of uncertainty to a
hug