Implementation of a Dynamic Crawling System for Extracting News Articles in a Korean Portal Site

Menghok Heak; Sang-Hyun Choi

Menghok Heak
Sang-Hyun Choi

Abstract

The main purpose of this study is to implement a system for crawling news data in Korea automatically from various sources based on a Korean Portal Site, Naver News. Since there are a lot of internet news in Korea, collecting that kind of data is very challenging to be done by manual as the amount of data is large and the source is various. Instead, using crawling system, which is a method of automatic data collecting, to collect those data is more effective. This paper will describe about the processes and its related tools for implementing the web-based crawling system which can be used to collect news data automatically from various sources based on the Naver Website. The application is developed using Java Programming Language as a Web Application under the environment of Spring Framework by dividing into 3 main components – Data Crawler, Data Processing and Data Manipulation where each component takes responsible for different performing tasks. The Data Crawler, which is the major component of the application, is implemented by using the JSOUP crawling library which will also be described in this paper. The output of application is in downloadable csv formatted file in 2 kinds, summary content and full content, which can be used for further researches. The system will provide ease and advantages to researchers in collecting data for their researches with effectiveness at minimal cost of time.