DraganSr: 2025-03-30

Sunday, March 30, 2025

data: RecordIO-Protobuf vs Parquet

when processing large amounts of data, for example for ML/AL
it is important to store data in a efficient and compact format.

JSON is not efficient, and CSV is also quite inefficient.

Parquet is a columnar format,
RecordIO-protobuf is used for binary record-level serialization.

amazon web services - Parquet vs. RecordIO - Stack Overflow

Parquet is great for analytics data due to its small file size and allows you to scan only the columns of interest.

RecordIO format is typically used for training machine learning models so that the data that the model needs is presented only when needed.

Leveraging RecordIO for Efficient Training and Cost Reduction in Amazon SageMaker | LinkedIn

RecordIO: It’s a streaming data format that organizes a file as a series of length-prefixed binary records. The “length-prefix” means that every chunk of data is preceded by 4 or 8 bytes that tell you how long the next record is.

Protobuf is a serialization format (like JSON or CSV but binary).
Protobuf encodes structured data (like Python objects) into an efficient, compact binary format.

RecordIO-wrapped Protobuf means you have a file where each Protobuf-encoded message is wrapped with a size prefix (via RecordIO). Each record starts with a 4-byte integer indicating its size, followed by the actual Protobuf message.

Common Data Formats for Training - Amazon SageMaker AI

Differences between LibSVM and RecordIO/Protobuf in the context of Machine Learning | by Manasa | Medium

Apache Mesos - RecordIO Data Format

xeno14/recordio: multiple Protocol Buffers in a single binary file: forked from google/or-tools @GitHub

eclesh/recordio: recordio implements a file format for a sequence of records in Go, @GitHub

Using Protobuf with TypeScript for data serialization - LogRocket Blog