DraganSr: data: parquet js/ts libs

Wednesday, January 28, 2026

data: parquet js/ts libs

parquet-wasm - npm

@dsnp/parquetjs - npm

duckdb - npm

from Claude.ai

JavaScript/TypeScript Parquet Libraries Comparison

Feature	parquet-wasm	hyparquet	@dsnp/parquetjs	parquetjs
Version	0.7.1	1.23.3	1.8.7	0.11.2
Last Updated	2 months ago	1 week ago	6 months ago	>1 year ago
Dependencies	0	0	13	8
Bundle Size	~1.2MB (WASM)	~189KB	~6.9MB	~219KB
Implementation	Rust/WASM	Pure JS	Pure JS/TS	Pure JS
TypeScript	✅ Native	✅ Native	✅ Native	❌ (types available)

Read/Write Support

	parquet-wasm	hyparquet	@dsnp/parquetjs	parquetjs
Read	✅	✅	✅	✅
Write	✅	✅ (separate pkg)	✅	✅
Streaming Read	✅	✅	✅	✅
Remote/HTTP	✅	✅ (range requests)	✅	✅
S3 Direct	❌	❌	✅	❌

Compression Codecs

	parquet-wasm	hyparquet	@dsnp/parquetjs	parquetjs
Uncompressed	✅	✅	✅	✅
Snappy	✅	✅	✅	✅
GZIP	✅	➕ (addon)	✅	✅
Brotli	✅	➕ (addon)	✅	✅
ZSTD	✅	➕ (addon)	❌	❌
LZ4	✅	➕ (addon)	❌	❌
LZO	✅	➕ (addon)	❌	✅

Environment Support

	parquet-wasm	hyparquet	@dsnp/parquetjs	parquetjs
Node.js	✅	✅	✅	✅
Browser	✅	✅	✅	❌
Edge/Workers	✅	✅	⚠️ Limited	❌

Performance Characteristics

Library	Best For
parquet-wasm	Large files, high performance, Arrow integration
hyparquet	Browser-first, small bundle, low latency
@dsnp/parquetjs	Node.js apps needing full read/write, S3 integration
parquetjs	Legacy projects (unmaintained)

Recommendations

For high-performance/large files → parquet-wasm

Uses Rust's battle-tested parquet crate compiled to WASM
Best compression codec support (including ZSTD, LZ4)
Outputs Apache Arrow format (great for analytics)
Larger initial bundle, but fastest for big data

For browser-first/lightweight → hyparquet

Zero dependencies, tiny bundle
Great for viewing/reading parquet in browser
HTTP range request support (partial file reads)
Writing requires separate hyparquet-writer package

For Node.js full-featured → @dsnp/parquetjs

Most complete pure-JS implementation
Built-in S3 support
Actively maintained fork of original parquetjs
Good TypeScript support
⚠️ Compression must be set per-column (as you discovered)

Avoid → parquetjs (original)

Unmaintained (>1 year without updates)
Use @dsnp/parquetjs instead

For your 268MB use case with compression needs, I'd suggest either parquet-wasm (best compression options, highest performance) or sticking with @dsnp/parquetjs but ensuring compression is set on each column in the schema.

dsnp/parquetjs compression

With 268MB and 100K row chunks, that should definitely compress. Something else is going on.

Let me check how the compression option actually gets applied:Found it. Let me check how the schema builds the column options:Now I see the issue clearly. The compression must be set on each column in the schema, NOT as a writer option. Let me verify there's no global compression option:This confirms it. @dsnp/parquetjs does NOT support a global compression option on the writer. You must specify compression on each column in the schema.

This won't work:

// ❌ This does nothing - compression option is ignored
const writer = await parquet.ParquetWriter.openFile(schema, 'output.parquet', {
  compression: 'GZIP'
});

This is how you must do it:

// ✅ Compression must be on each column
const schema = new parquet.ParquetSchema({
  id: { type: 'INT64', compression: 'GZIP' },
  name: { type: 'UTF8', compression: 'GZIP' },
  value: { type: 'DOUBLE', compression: 'GZIP' }
});

If you have many columns and want to apply compression to all of them programmatically:

const fields = {
  id: { type: 'INT64' },
  name: { type: 'UTF8' },
  value: { type: 'DOUBLE' }
};

// Add compression to all fields
for (const key of Object.keys(fields)) {
  fields[key].compression = 'GZIP';
}

const schema = new parquet.ParquetSchema(fields);

streaming / appending to parquet files

Let me check the parquet-wasm API for that capability.Based on my research, no, parquet-wasm does not support appending multiple row groups incrementally to a single file.

Looking at the API, parquet-wasm has a simple writeParquet(table, writerProperties) function that takes an entire Arrow Table and writes it out at once. There's no streaming writer or incremental row group API.

Here's the comparison for your use case:

Library	Incremental Row Group Writing
@dsnp/parquetjs	✅ Yes - `ParquetWriter` buffers rows and writes row groups automatically based on `rowGroupSize`
parquet-wasm	❌ No - single-shot `writeParquet()` only
hyparquet-writer	❌ No - single-shot API
PyArrow (Python)	✅ Yes - `ParquetWriter.write_table()` can be called multiple times

Python vs TypeScript for Parquet

Yes, for Parquet writing, Python is significantly better. The JS ecosystem is pretty immature here.

PyArrow gives you:

Streaming writer with multiple row groups
Global compression settings (not per-column)
Append mode to existing files
All compression codecs (ZSTD, LZ4, etc.)
Battle-tested, production-grade
Much better performance

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([
    ('id', pa.int64()),
    ('name', pa.string()),
])

with pq.ParquetWriter('output.parquet', schema, compression='zstd') as writer:
    for chunk in chunks:
        table = pa.Table.from_pydict(chunk, schema=schema)
        writer.write_table(table)  # Each call can create a new row group

That's it. Compression just works globally.

If you must stay in Node.js, your options are:

Spawn Python - call a Python script from Node for the Parquet writing
Use DuckDB - the duckdb npm package can write Parquet with proper compression
Fix @dsnp/parquetjs - add compression to every column (tedious but works)

DuckDB in Node might actually be your best hybrid option:

import { Database } from 'duckdb-async';

const db = await Database.create(':memory:');
await db.run(`
  COPY (SELECT * FROM read_json_auto('data.json')) 
  TO 'output.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)
`);

if Parquet I/O is a significant part of your workflow, Python is the right tool. The JS libraries are mostly community efforts while PyArrow is backed by the Arrow project itself.

From Gemini.ai

Quick Verdict

Best for Performance (Heavy Data): parquet-wasm (Fastest throughput, returns Arrow data).
Best for Web/Browser (Low Latency): hyparquet (Tiny size, fast startup, no WASM overhead).
Best for Analytics/SQL: duckdb-wasm (Full SQL engine, fast queries, heavy load).
Legacy/Node.js: parquetjs (Use the @dsnp fork).

Comparison Matrix

Library	Type	Speed	Bundle Size	Popularity	Support
parquet-wasm	Rust (WASM)	🚀 Excellent (Throughput)	Heavy (~1.2MB / ~450KB read-only)	Rising (Niche high-perf)	✅ Active (Kyle Barron)
hyparquet	Pure JS	⚡ Good (Startup/Latency)	🪶 Tiny (~10KB min)	Low but growing (~730 stars)	✅ Active (Single maintainer)
duckdb-wasm	SQL Engine	🐢 Startup / 🚀 Query	🐘 Huge (20MB+)	🔥 High (~400k weekly)	✅ Excellent (DuckDB Team)
apache-arrow	JS / TS	🔸 Moderate (Reading)	📦 Large (~5MB unpacked)	👑 Massive (~1M weekly)	✅ Excellent (Apache)
parquetjs	Pure JS	🐌 Slow	🔸 Medium (~220KB)	📉 Legacy (~370k monthly)	❌ Abandoned (Use forks)

Detailed Breakdown

1. parquet-wasm

Best For: Heavy data processing, integration with Apache Arrow, and scenarios where throughput matters more than initial load time.
Pros:
- Speed: Uses Rust's high-performance parquet crate compiled to WebAssembly. It is significantly faster than pure JS libraries for parsing large files.
- Arrow Integration: Reads data directly into Apache Arrow tables (efficient, zero-copy-like architecture).
- Features: Supports all compression codecs (Snappy, Gzip, Brotli, ZSTD, etc.) and writing capabilities.
Cons:
- Size: The WASM bundle is heavy. Even the "read-only" build is ~450KB, which can hurt "Time to Interactive" on slower networks.
- Complexity: Requires handling async WASM initialization and memory management.

2. hyparquet

Best For: Browser-based viewers, dashboards, and apps where fast startup (low latency) is critical.
Pros:
- Size: Extremely lightweight (~10KB). It loads instantly compared to WASM libraries.
- Async/Streaming: Optimized for HTTP Range requests. It can read metadata and specific chunks without downloading the whole file, making it feel "instant" for large remote files.
- Simplicity: Pure JavaScript. No WASM compile steps, no complex bundler configurations, and easy to debug.
Cons:
- Throughput: Slower than parquet-wasm if you need to parse every row of a massive file into memory at once.
- Output: Returns standard JS objects (or arrays), which use more memory than Arrow buffers.

3. duckdb-wasm

Best For: Complex analytics where you need to run SQL queries (Group By, Join, Filter) on Parquet files client-side.
Pros:
- Query Performance: Once loaded, it is incredibly fast. You can run complex SQL queries directly on the Parquet file without loading it all into JS memory.
- Versatility: Handles CSV, JSON, and Arrow formats as well.
Cons:
- Startup Cost: The "elephant" in the room. It downloads a massive WASM binary and data files. It is not suitable for quick "page load" interactions.
- Overhead: Overkill if you just need to read a few rows or simple data.

4. parquetjs (and forks)

Best For: Legacy Node.js applications or simple scripts where WASM is difficult to deploy.
Status: The original ironSource/parquetjs is abandoned (6+ years old).
Recommendation: If you must use this API, use the maintained fork: @dsnp/parquetjs.
Pros:
- Familiarity: Simple, row-based API (Iterators).
- Pure JS: Easy to use in Node.js without WASM compatibility issues.
Cons:
- Performance: Significantly slower than WASM-based solutions.
- Maintenance: Even forks receive only sporadic updates compared to the Arrow/DuckDB ecosystem.

Summary Recommendation

Use hyparquet if you are building a web UI and need the page to load fast. It's the modern, lightweight choice for browsers.
Use parquet-wasm if you are building a heavy data tool (like a BI tool in the browser) and plan to use Apache Arrow for data manipulation.
Use duckdb-wasm if you need SQL capabilities client-side.
Use @dsnp/parquetjs only if you are in a Node.js environment where WASM is not an option and performance is not critical.

===

The relationship between `parquet-wasm` and `apache-arrow` is complementary.

They represent the two halves of a modern data pipeline: Storage vs. Memory.

You did not see apache-arrow in the detailed list of Parquet libraries because apache-arrow (the NPM package) cannot read Parquet files by itself.

Here is the specific relationship:

1. The Separation of Concerns

Parquet is a Storage Format (On Disk). It is highly compressed and encoded to save space on your hard drive or network.
Arrow is a Memory Format (In RAM). It is uncompressed and laid out specifically for the CPU to process extremely fast (Zero-Copy).

2. The Bridge (`parquet-wasm`)

Since the standard apache-arrow JavaScript library does not have a native Parquet parser, you need a "loader" to convert the compressed file into the Arrow memory format.

parquet-wasm acts as this bridge.
It takes the compressed Parquet bytes -> decompresses them in WebAssembly (Rust) -> and outputs Apache Arrow tables.

3. The Workflow

In a typical high-performance web application (2025), you use them together like this:

Fetch: You download a file data.parquet.
Decode (parquet-wasm): You pass the file to parquet-wasm. It decompresses the Snappy/Gzip data and organizes it into Arrow Buffers inside WASM memory.
Analyze (apache-arrow): parquet-wasm hands back an Arrow Table. You then import the apache-arrow library to actually interact with that table (e.g., table.getChild('price'), table.filter(...), or iterators).

Summary Table

Library	Role	Analogy
parquet-wasm	The Chef (Prepares the food)	Takes raw ingredients (Parquet file), cooks them (decompresses/parses), and puts them on a plate (Arrow Table).
apache-arrow	The Cutlery (Consumes the food)	The fork and knife you use to actually eat (read/manipulate) the data once it is on the plate.

Why parquet-wasm is the winner for speed: Because it outputs data directly into the Arrow format, parquet-wasm allows you to hand that data off to other tools (like DuckDB, Arquero, or WebGL visualization libraries like Deck.gl) with Zero-Copy. The data doesn't need to be converted into slow JavaScript objects (like hyparquet does); it stays as a binary blob that the computer reads instantly.

===

duckdb-wasm

1. Is `duckdb-wasm` Standalone?

Yes and No.

Yes, it is a standalone Engine: duckdb-wasm is a complete SQL database engine running in your browser. It does not rely on parquet-wasm or apache-arrow to function. It has its own internal C++ code (compiled to WASM) that handles Parquet parsing, query execution, and data management. You can drop it into a page, give it a Parquet URL, and run SQL without installing anything else.
No, it is part of the "Modern Stack": While it can work alone, it is designed to plug directly into the Apache Arrow ecosystem.
- Input: It can read standard JS objects, but it is much faster if you feed it Apache Arrow tables.
- Output: When you run a query (SELECT * FROM ...), the most efficient way to get results out is as an Apache Arrow table.

2. Relationship with other libraries

Library	Relationship to `duckdb-wasm`
parquet-wasm	Competitor (sort of). Both libraries can read Parquet files. • Use `duckdb-wasm` if you need to run SQL queries (filter, join, group). • Use `parquet-wasm` if you just want to convert Parquet -> Arrow/JS as fast as possible without the overhead of a full SQL engine.
apache-arrow	Best Friend. `duckdb-wasm` uses Arrow as its "data interchange" layer. You will almost always install the `apache-arrow` library alongside `duckdb-wasm` so you can actually read the results of your SQL queries efficiently.

3. License

It is extremely permissive.

License: MIT License
What this means: It is free for commercial use, modification, and distribution. You do not need to pay or open-source your own code to use it. This is the same license as React, Angular, and many other standard web tools.

Summary: The "Data Stack" Architecture

If you are building a serious data tool in 2025, your stack often looks like this:

Storage: Parquet File (Server/S3)
Engine: duckdb-wasm (Downloads file, runs SQL, outputs Arrow)
Visualization: apache-arrow (Reads the Arrow output from DuckDB to render charts/tables)