A Comparison of HDFS File Formats: Avro, Parquet and ORC

  • Ashok Shivayogappa, Supreeth S.

Abstract

Hadoop isone of the standard platforms for managing and storing Big Data in distributed systems. But the lack of gooddevelopers to writeMap Reduce programs into the development environment has pushed the adoption of SQL based query system into the Hadoop Ecosystem in an attempt to benefit from the traditional relationaldatabase systems in terms of the skills, especially in the Business process and Intelligent Analytical process.  On top of the Hadoop environment a new framework has arrived ie; Apache Hiveas the standard data warehouse engine. This is one of the challenges by the industry-leading developers to work on improvements both in query execution and as well as in data storage paradigms.

In this work, various data structure file formats like Avro, Parquet, and ORC are differentiated with text file formats to evaluate the storage optimization, the performance of the database queries. Various kinds of query patterns have been evaluated.The output of the study shows that ORC and Parquet file format takes up less storage space compared with Avro and text files format, it is because of binary data formats and compression techniques used. Furthermore, Aggregate queries of the ORC and Parquet data structures are quicker compared with Avro or text file formats because earlier two formats support well for column-based queries.

Published
2020-06-06
How to Cite
Ashok Shivayogappa, Supreeth S. (2020). A Comparison of HDFS File Formats: Avro, Parquet and ORC. International Journal of Advanced Science and Technology, 29(04), 4665 -. Retrieved from http://sersc.org/journals/index.php/IJAST/article/view/24879