when processing large amounts of data, for example for ML/AL
it is important to store data in a efficient and compact format.
JSON is not efficient, and CSV is also quite inefficient.
- Parquet is a columnar format,
- RecordIO-protobuf is used for binary record-level serialization.
Parquet is great for analytics data due to its small file size and allows you to scan only the columns of interest.
RecordIO format is typically used for training machine learning models so that the data that the model needs is presented only when needed.
Leveraging RecordIO for Efficient Training and Cost Reduction in Amazon SageMaker | LinkedIn
RecordIO: It’s a streaming data format that organizes a file as a series of length-prefixed binary records. The “length-prefix” means that every chunk of data is preceded by 4 or 8 bytes that tell you how long the next record is.
Protobuf is a serialization format (like JSON or CSV but binary).
Protobuf encodes structured data (like Python objects) into an efficient, compact binary format.
RecordIO-wrapped Protobuf means you have a file where each Protobuf-encoded message is wrapped with a size prefix (via RecordIO). Each record starts with a 4-byte integer indicating its size, followed by the actual Protobuf message.
Apache Mesos - RecordIO Data Format
xeno14/recordio: multiple Protocol Buffers in a single binary file: forked from google/or-tools @GitHub
eclesh/recordio: recordio implements a file format for a sequence of records in Go, @GitHub
Using Protobuf with TypeScript for data serialization - LogRocket Blog