Developed non-text contents automatic extractor for effective utilization and sharing of national R&D reports

  • Il-Kwon Lim
  • Kwang-Nam Choi
  • Ji-Seung Son
  • YongJu Shin

Abstract

In KISTI, R&D report, which is the result of national R&D project, is collected from task management (professional) agency under the clearing of the ministry, and it is constructed as high-quality DB and utilized for researchers. The collected R&D reports are converted to xml format by standardized pdf for DB construction. In addition, it is necessary to collect non-text contents for R&D report supplementary service of table / picture. Accordingly, when converting to XML format, non-text titles and contents such as tables and pictures are automatically extracted and developed.In the process of collecting non-text contents, non-text is extracted mainly about captions and objects corresponding to tables and pictures. Therefore, in actual process from the cover to the end, a large amount of unwanted tables and pictures are extracted, it is under inspection and supplementation as work. In order to improve the system, we improved the extraction of non - text contents by extracting the non - text contents extraction range based on the table of contents for table and figure contents and captions of R&D report. We also developed API so that we can improve the function of checking and supplementing extracted contents more conveniently and sharing and utilizing contents in other systems.Except for reports with poor technical quality of the tables and figures in the R&D report, reports with many attachments at the end of the appendix have significantly reduced the extraction of non-text content, thus reducing DB construction efforts. First, we will apply API to the search result of the homepage of NTIS (National Science & Technology Information Service), and it will be used in NDSL(National Digital Science Library)and external organizations in the future.

Published
2019-09-27
How to Cite
Lim, I.-K., Choi, K.-N., Son, J.-S., & Shin, Y. (2019). Developed non-text contents automatic extractor for effective utilization and sharing of national R&D reports. International Journal of Advanced Science and Technology, 28(3), 08 - 16. Retrieved from http://sersc.org/journals/index.php/IJAST/article/view/258
Section
Articles