Thursday, March 06, 2025

Data formats: Apache Parquet vs ORC vs Avro

Why Parquet vs. ORC: An In-depth Comparison of File Formats | by Ankush Singh | Medium

Apache Parquet is a columnar storage file format available to any project in the Hadoop ecosystem. It’s designed for efficiency and performance, and it’s particularly well-suited for running complex queries on large datasets.

Parquet is an excellent choice when dealing with large, complex, and nested data structures, especially for read-heavy workloads or when you want to perform analytics using tools like Apache Spark or Apache Arrow. Its columnar storage approach makes it an excellent choice for data warehousing solutions where aggregation queries are common.

Parquet

Apache Parquet - Wikipedia

GitHub - apache/parquet-format: Apache Parquet Format


ORC is another popular file format in the Hadoop ecosystem. It’s a self-describing, type-aware columnar file format designed for Hadoop workloads.

ORC is commonly used in cases where high-speed writing is necessary, particularly with Hive-based frameworks. It also suits well when data modifications (updates and deletes) are needed in your use case because it supports ACID properties. Lastly, ORC is a good choice when using complex and nested data types.





while Parquet is a columnar data format, for "row" level storage is is often used "Avro", that besides efficiently storing data in binary format also includes a data schema.

Apache Avro - Wikipedia

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded.


Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.