Design and Implementation of Web Scraping Platform Using Spring Framework Based on Distributed Hadoop Ecosystem

Jung Sang Yoo; Myeong Ho Lee

Jung Sang Yoo
Myeong Ho Lee

Abstract

Background/Objectives: Recently, a rapid increase of general data and informal data and the quick data generation in all fields have a great effect on how to utilize data.The demand is increasing for utilizing Big Data in making important decisions for organizations by interpreting various patterns in a number of heterogeneous data, and predicting the future.

Methods/Statistical analysis: Furthermore, it is necessary to provide services quickly based on up-to-date information. However, in most research, the collection, loading and processing of data have been insufficient and great attention has been paid to the analysis of data.

Findings: Thus, this research collects the data searched with keywords through the Spring Framework using next generation web standards and through Web scraping based on the Hadoop 2.0 Ecosystem, loads the collected data on to a Hadoop Distributed File System (HDFS) and HBase, and designs and implements a Big Data utilization system that can schematize, through a word cloud, the results of analysis of keyword, title, contents and morpheme on the basis of contents and nouns extracted from the loaded data with a Twitter morpheme analyzer.

Improvements/Applications: This research intends to provide a platform reference model that is applicable to enterprise groupware to which the Distributed Hadoop Ecosystem and the Spring Framework under next generation web standards are applied.

Keywords: Big Data, Spring Framework, Web Scraping, HDFS, HBase, Distributed Hadoop Ecosystem.