DraganSr: 2025-12-14

Sunday, December 14, 2025

parquet data format, libs, databases

very efficient and standard way to store data, in particular for analytics and "columnar" type of data
(with repeated values)

apache/parquet-format: Apache Parquet Format @GitHub

Details you need to know about Apache Parquet | by Liam | Medium

I spent 8 hours learning Parquet. Here’s what I discovered

Anatomy of a Parquet File | Towards Data Science

Reading parquet from scratch -

Parquet

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming languages and analytics tools.

Apache Parquet - Wikipedia

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem inspired by Google Dremel interactive ad-hoc query system for analysis of read-only nested data.[3] It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Reading and Writing the Apache Parquet Format — Apache Arrow v22.0.0

The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala, and Apache Spark adopting it as a shared standard for high performance data IO.

Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. We have been concurrently developing the C++ implementation of Apache Parquet, which includes a native, multithreaded C++ adapter to and from in-memory Arrow data. PyArrow includes Python bindings to this code, which thus enables reading and writing Parquet files with pandas as well.

TypeScript/JavaScript

the lib is not large, but dependencies are?

parquetjs - npm

"a fully asynchronous, pure JavaScript implementation of the Parquet file format. The implementation conforms with the Parquet specification and is tested for compatibility with Apache's Java reference implementation.

What is Parquet?: Parquet is a column-oriented file format; it allows you to write a large amount of structured data to a file, compress it and then read parts of it back out efficiently. The Parquet format is based on Google's Dremel paper.

duckdb - npm

"This package provides a Node.js API for DuckDB, the "SQLite for Analytics". The API for this client is somewhat compliant to the SQLite Node.js client for easier transition (and transition you must eventually)."

DuckDB Installation – DuckDB

Python

parquet · PyPI

parquet-python is a pure-python implementation (currently with only read-support) of the parquet format. It comes with a script for reading parquet files and outputting the data to stdout as JSON or TSV (without the overhead of JVM startup). Performance has not yet been optimized, but it’s useful for debugging and quick viewing of data in files.

How Parquet Files are Written – Row Groups, Pages, Required Memory and Flush Operations – Large-Scale Data Engineering in Cloud