Comparison of various HDFS file formats

I’ve been working with Big data since 2017. As I’m from data warehousing background, it was easier for me to understand what’s what, and build an analogy between DWH & big data frameworks. However, the various file formats used in HDFS always caught me off guard.

In DWH, I never considered how the files are stored in DB, it’s managed by the database, maybe DBA might know how its done at the backend, as a DB developer it never bothered me. However, in HDFS we have several types of formats to choose from - Avro, ORC & Parquet. The best way to understand something is to spend time with it. So, decided to see how the different files behave for same data.

To see the performance of the file formats, I decided to load the same data in different formats. Identifying the dataset to work with took lots of time. While looking for the data came across a wonderful article to discover free dataset for data science related purpose in Dataquest’s blog. I was looking for a csv file with approximately 1 GB of size. I couldn’t finalize on the dataset and then my friend suggested movie lens data. Size wise its not close to my requirement, but 500+ MB is good enough.

We got the data, next ingredient to the dish would be HDFS cluster - AWS EMR. The biggest fear in choosing AWS paid service is the fear of overwhelming billing by the book seller. Hiding the fear, took a smallest possible EMR cluster1. With all the queries2 made ready before spinning up the cluster, I did went ahead with the operation.

All things set, after bootstrapping, the EMR is ready to be connected. And then, connection time out error embraced me while SSHing from EC2. After resolving the issue3, the entire usage of EMR went for 28 mins.

Observations

File TypeFile Size (in MB)Time taken to build table (in secs)Time taken to calculate count() (in secs)Time taken to SELECT sample records4 (in secs)
CSV508.720.27117.0610.085
Avro489.854.64225.3560.257
ORC129.6109.974.2860.075
Parquet307.758.87317.7090.091

*Note:* The entire metrics are obtained while executing queries on Hive with tez engine.


Footnotes 1. 1 node of Master & Core is selected with Spot instead of On Demand instance. Below are the cluster details.

MasterCoreSpot Pricing
m5.xlargem5.xlarge$0.064 per Hr

2. Queries are available for reference in Git repo -

3. In the security group of EMR, inbound SSH connections were blocked. On allowing SSH, the connection is established.

4. Sample data for SELECT is limited to 11 records.

Page source

Page last updated on: 2025-01-17 00:23:28 +0530 +0530
Git commit: 734c6b0