DraganSr: data formats: CSV vs Apache Parquet

Wednesday, March 22, 2023

data formats: CSV vs Apache Parquet

Different types of data formats CSV, Parquet, and Feather | by Som | MLearning.ai | Medium

Parquet is lightweight for saving data frames. Parquet uses efficient data compression and encoding scheme for fast data storing and retrieval. Parquet with “gzip” compression (for storage): It is slightly faster to export than just .csv (if the CSV needs to be zipped, then parquet is much faster). Importing is about 2x times faster than CSV. The compression is around 22% of the original file size, which is about the same as zipped CSV files.

Feather format is more efficient compared to parquet format in terms of data retrieval. Though it occupies comparatively more space than parquet format storing in this format will ensure efficient data retrieval.

Apache Parquet - Wikipedia

Apache Parquet is comparable to RCFile and Optimized Row Columnar (ORC) file formats — all three fall under the category of columnar data storage within the Hadoop ecosystem. They all have better compression and encoding with improved read performance at the cost of slower writes. In addition to these features, Apache Parquet supports limited schema evolution, i.e., the schema can be modified according to the changes in the data. It also provides the ability to add new columns and merge schemas that do not conflict.

Apache Arrow is designed as an in-memory complement to on-disk columnar formats like Parquet and ORC. The Arrow and Parquet projects include libraries that allow for reading and writing between the two formats.

apache/parquet-format: Apache Parquet @GitHub

Java

Feather File Format — Apache Arrow v11.0.0

Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. Feather was created early in the Arrow project as a proof of concept for fast, language-agnostic data frame storage for Python (pandas) and R.

parquetjs - npm: JavaScript

This package contains a fully asynchronous, pure JavaScript implementation of the Parquet file format. The implementation conforms with the Parquet specification and is tested for compatibility with Apache's Java reference implementation.

What is Parquet?: Parquet is a column-oriented file format; it allows you to write a large amount of structured data to a file, compress it and then read parts of it back out efficiently. The Parquet format is based on Google's Dremel paper.

GoLang

parquet package - github.com/segmentio/parquet-go - Go Packages

parquet package - github.com/apache/arrow/go/parquet - Go Packages

segmentio/parquet-go: Go library to read/write Parquet files

xitongsys/parquet-go: pure golang library for reading/writing parquet file @GitHub

Wednesday, March 22, 2023

data formats: CSV vs Apache Parquet

No comments: