Tags/Avro

Comparison of various HDFS file formats

February 23, 2020 | 3 minutes | 448 words |

I’ve been working with Big data since 2017. As I’m from data warehousing background, it was easier for me to understand what’s what, and build an analogy between DWH & big data frameworks. However, the various file formats used in HDFS always caught me off guard.

In DWH, I never considered how the files are stored in DB, it’s managed by the database, maybe DBA might know how its done at the backend, as a DB developer it never bothered me. However, in HDFS we have several types of formats to choose from - Avro, ORC & Parquet. The best way to understand something is to spend time with it. So, decided to see how the different files behave for same data.