parquet-wasm - npm
from Claude.ai
from Claude.ai
JavaScript/TypeScript Parquet Libraries Comparison
| Feature | parquet-wasm | hyparquet | @dsnp/parquetjs | parquetjs |
|---|---|---|---|---|
| Version | 0.7.1 | 1.23.3 | 1.8.7 | 0.11.2 |
| Last Updated | 2 months ago | 1 week ago | 6 months ago | >1 year ago |
| Dependencies | 0 | 0 | 13 | 8 |
| Bundle Size | ~1.2MB (WASM) | ~189KB | ~6.9MB | ~219KB |
| Implementation | Rust/WASM | Pure JS | Pure JS/TS | Pure JS |
| TypeScript | ✅ Native | ✅ Native | ✅ Native | ❌ (types available) |
Read/Write Support
| parquet-wasm | hyparquet | @dsnp/parquetjs | parquetjs | |
|---|---|---|---|---|
| Read | ✅ | ✅ | ✅ | ✅ |
| Write | ✅ | ✅ (separate pkg) | ✅ | ✅ |
| Streaming Read | ✅ | ✅ | ✅ | ✅ |
| Remote/HTTP | ✅ | ✅ (range requests) | ✅ | ✅ |
| S3 Direct | ❌ | ❌ | ✅ | ❌ |
Compression Codecs
| parquet-wasm | hyparquet | @dsnp/parquetjs | parquetjs | |
|---|---|---|---|---|
| Uncompressed | ✅ | ✅ | ✅ | ✅ |
| Snappy | ✅ | ✅ | ✅ | ✅ |
| GZIP | ✅ | ➕ (addon) | ✅ | ✅ |
| Brotli | ✅ | ➕ (addon) | ✅ | ✅ |
| ZSTD | ✅ | ➕ (addon) | ❌ | ❌ |
| LZ4 | ✅ | ➕ (addon) | ❌ | ❌ |
| LZO | ✅ | ➕ (addon) | ❌ | ✅ |
Environment Support
| parquet-wasm | hyparquet | @dsnp/parquetjs | parquetjs | |
|---|---|---|---|---|
| Node.js | ✅ | ✅ | ✅ | ✅ |
| Browser | ✅ | ✅ | ✅ | ❌ |
| Edge/Workers | ✅ | ✅ | ⚠️ Limited | ❌ |
Performance Characteristics
| Library | Best For |
|---|---|
| parquet-wasm | Large files, high performance, Arrow integration |
| hyparquet | Browser-first, small bundle, low latency |
| @dsnp/parquetjs | Node.js apps needing full read/write, S3 integration |
| parquetjs | Legacy projects (unmaintained) |
Recommendations
For high-performance/large files → parquet-wasm
- Uses Rust's battle-tested parquet crate compiled to WASM
- Best compression codec support (including ZSTD, LZ4)
- Outputs Apache Arrow format (great for analytics)
- Larger initial bundle, but fastest for big data
For browser-first/lightweight → hyparquet
- Zero dependencies, tiny bundle
- Great for viewing/reading parquet in browser
- HTTP range request support (partial file reads)
- Writing requires separate
hyparquet-writerpackage
For Node.js full-featured → @dsnp/parquetjs
- Most complete pure-JS implementation
- Built-in S3 support
- Actively maintained fork of original parquetjs
- Good TypeScript support
- ⚠️ Compression must be set per-column (as you discovered)
Avoid → parquetjs (original)
- Unmaintained (>1 year without updates)
- Use
@dsnp/parquetjsinstead
For your 268MB use case with compression needs, I'd suggest either parquet-wasm (best compression options, highest performance) or sticking with @dsnp/parquetjs but ensuring compression is set on each column in the schema.
dsnp/parquetjs compression
With 268MB and 100K row chunks, that should definitely compress. Something else is going on.
Let me check how the compression option actually gets applied:Found it. Let me check how the schema builds the column options:Now I see the issue clearly. The compression must be set on each column in the schema, NOT as a writer option. Let me verify there's no global compression option:This confirms it. @dsnp/parquetjs does NOT support a global compression option on the writer. You must specify compression on each column in the schema.
This won't work:
// ❌ This does nothing - compression option is ignored
const writer = await parquet.ParquetWriter.openFile(schema, 'output.parquet', {
compression: 'GZIP'
});
This is how you must do it:
// ✅ Compression must be on each column
const schema = new parquet.ParquetSchema({
id: { type: 'INT64', compression: 'GZIP' },
name: { type: 'UTF8', compression: 'GZIP' },
value: { type: 'DOUBLE', compression: 'GZIP' }
});
If you have many columns and want to apply compression to all of them programmatically:
const fields = {
id: { type: 'INT64' },
name: { type: 'UTF8' },
value: { type: 'DOUBLE' }
};
// Add compression to all fields
for (const key of Object.keys(fields)) {
fields[key].compression = 'GZIP';
}
const schema = new parquet.ParquetSchema(fields);
streaming / appending to parquet files
Let me check the parquet-wasm API for that capability.Based on my research, no, parquet-wasm does not support appending multiple row groups incrementally to a single file.
Looking at the API, parquet-wasm has a simple writeParquet(table, writerProperties) function that takes an entire Arrow Table and writes it out at once. There's no streaming writer or incremental row group API.
Here's the comparison for your use case:
| Library | Incremental Row Group Writing |
|---|---|
| @dsnp/parquetjs | ✅ Yes - ParquetWriter buffers rows and writes row groups automatically based on rowGroupSize |
| parquet-wasm | ❌ No - single-shot writeParquet() only |
| hyparquet-writer | ❌ No - single-shot API |
| PyArrow (Python) | ✅ Yes - ParquetWriter.write_table() can be called multiple times |
Python vs TypeScript for Parquet
Yes, for Parquet writing, Python is significantly better. The JS ecosystem is pretty immature here.
PyArrow gives you:
- Streaming writer with multiple row groups
- Global compression settings (not per-column)
- Append mode to existing files
- All compression codecs (ZSTD, LZ4, etc.)
- Battle-tested, production-grade
- Much better performance
import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema([
('id', pa.int64()),
('name', pa.string()),
])
with pq.ParquetWriter('output.parquet', schema, compression='zstd') as writer:
for chunk in chunks:
table = pa.Table.from_pydict(chunk, schema=schema)
writer.write_table(table) # Each call can create a new row group
That's it. Compression just works globally.
If you must stay in Node.js, your options are:
- Spawn Python - call a Python script from Node for the Parquet writing
- Use DuckDB - the
duckdbnpm package can write Parquet with proper compression - Fix @dsnp/parquetjs - add compression to every column (tedious but works)
DuckDB in Node might actually be your best hybrid option:
import { Database } from 'duckdb-async';
const db = await Database.create(':memory:');
await db.run(`
COPY (SELECT * FROM read_json_auto('data.json'))
TO 'output.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)
`);
if Parquet I/O is a significant part of your workflow, Python is the right tool. The JS libraries are mostly community efforts while PyArrow is backed by the Arrow project itself.
From Gemini.ai
Quick Verdict
- Best for Performance (Heavy Data):
parquet-wasm(Fastest throughput, returns Arrow data). - Best for Web/Browser (Low Latency):
hyparquet(Tiny size, fast startup, no WASM overhead). - Best for Analytics/SQL:
duckdb-wasm(Full SQL engine, fast queries, heavy load). - Legacy/Node.js:
parquetjs(Use the@dsnpfork).
Comparison Matrix
| Library | Type | Speed | Bundle Size | Popularity | Support |
|---|---|---|---|---|---|
| parquet-wasm | Rust (WASM) | 🚀 Excellent (Throughput) | Heavy (~1.2MB / ~450KB read-only) | Rising (Niche high-perf) | ✅ Active (Kyle Barron) |
| hyparquet | Pure JS | ⚡ Good (Startup/Latency) | 🪶 Tiny (~10KB min) | Low but growing (~730 stars) | ✅ Active (Single maintainer) |
| duckdb-wasm | SQL Engine | 🐢 Startup / 🚀 Query | 🐘 Huge (20MB+) | 🔥 High (~400k weekly) | ✅ Excellent (DuckDB Team) |
| apache-arrow | JS / TS | 🔸 Moderate (Reading) | 📦 Large (~5MB unpacked) | 👑 Massive (~1M weekly) | ✅ Excellent (Apache) |
| parquetjs | Pure JS | 🐌 Slow | 🔸 Medium (~220KB) | 📉 Legacy (~370k monthly) | ❌ Abandoned (Use forks) |
Detailed Breakdown
1. parquet-wasm
- Best For: Heavy data processing, integration with Apache Arrow, and scenarios where throughput matters more than initial load time.
- Pros:
- Speed: Uses Rust's high-performance
parquetcrate compiled to WebAssembly. It is significantly faster than pure JS libraries for parsing large files. - Arrow Integration: Reads data directly into Apache Arrow tables (efficient, zero-copy-like architecture).
- Features: Supports all compression codecs (Snappy, Gzip, Brotli, ZSTD, etc.) and writing capabilities.
- Speed: Uses Rust's high-performance
- Cons:
- Size: The WASM bundle is heavy. Even the "read-only" build is ~450KB, which can hurt "Time to Interactive" on slower networks.
- Complexity: Requires handling async WASM initialization and memory management.
2. hyparquet
- Best For: Browser-based viewers, dashboards, and apps where fast startup (low latency) is critical.
- Pros:
- Size: Extremely lightweight (~10KB). It loads instantly compared to WASM libraries.
- Async/Streaming: Optimized for HTTP Range requests. It can read metadata and specific chunks without downloading the whole file, making it feel "instant" for large remote files.
- Simplicity: Pure JavaScript. No WASM compile steps, no complex bundler configurations, and easy to debug.
- Cons:
- Throughput: Slower than
parquet-wasmif you need to parse every row of a massive file into memory at once. - Output: Returns standard JS objects (or arrays), which use more memory than Arrow buffers.
- Throughput: Slower than
3. duckdb-wasm
- Best For: Complex analytics where you need to run SQL queries (Group By, Join, Filter) on Parquet files client-side.
- Pros:
- Query Performance: Once loaded, it is incredibly fast. You can run complex SQL queries directly on the Parquet file without loading it all into JS memory.
- Versatility: Handles CSV, JSON, and Arrow formats as well.
- Cons:
- Startup Cost: The "elephant" in the room. It downloads a massive WASM binary and data files. It is not suitable for quick "page load" interactions.
- Overhead: Overkill if you just need to read a few rows or simple data.
4. parquetjs (and forks)
- Best For: Legacy Node.js applications or simple scripts where WASM is difficult to deploy.
- Status: The original
ironSource/parquetjsis abandoned (6+ years old). - Recommendation: If you must use this API, use the maintained fork:
@dsnp/parquetjs. - Pros:
- Familiarity: Simple, row-based API (Iterators).
- Pure JS: Easy to use in Node.js without WASM compatibility issues.
- Cons:
- Performance: Significantly slower than WASM-based solutions.
- Maintenance: Even forks receive only sporadic updates compared to the Arrow/DuckDB ecosystem.
Summary Recommendation
- Use
hyparquetif you are building a web UI and need the page to load fast. It's the modern, lightweight choice for browsers. - Use
parquet-wasmif you are building a heavy data tool (like a BI tool in the browser) and plan to use Apache Arrow for data manipulation. - Use
duckdb-wasmif you need SQL capabilities client-side. - Use
@dsnp/parquetjsonly if you are in a Node.js environment where WASM is not an option and performance is not critical.
===
The relationship between parquet-wasm and apache-arrow is complementary.
They represent the two halves of a modern data pipeline: Storage vs. Memory.
You did not see apache-arrow in the detailed list of Parquet libraries because apache-arrow (the NPM package) cannot read Parquet files by itself.
Here is the specific relationship:
1. The Separation of Concerns
- Parquet is a Storage Format (On Disk). It is highly compressed and encoded to save space on your hard drive or network.
- Arrow is a Memory Format (In RAM). It is uncompressed and laid out specifically for the CPU to process extremely fast (Zero-Copy).
2. The Bridge (parquet-wasm)
Since the standard apache-arrow JavaScript library does not have a native Parquet parser, you need a "loader" to convert the compressed file into the Arrow memory format.
parquet-wasmacts as this bridge.- It takes the compressed Parquet bytes -> decompresses them in WebAssembly (Rust) -> and outputs Apache Arrow tables.
3. The Workflow
In a typical high-performance web application (2025), you use them together like this:
- Fetch: You download a file
data.parquet. - Decode (
parquet-wasm): You pass the file toparquet-wasm. It decompresses the Snappy/Gzip data and organizes it into Arrow Buffers inside WASM memory. - Analyze (
apache-arrow):parquet-wasmhands back an ArrowTable. You then import theapache-arrowlibrary to actually interact with that table (e.g.,table.getChild('price'),table.filter(...), or iterators).
Summary Table
| Library | Role | Analogy |
|---|---|---|
| parquet-wasm | The Chef (Prepares the food) | Takes raw ingredients (Parquet file), cooks them (decompresses/parses), and puts them on a plate (Arrow Table). |
| apache-arrow | The Cutlery (Consumes the food) | The fork and knife you use to actually eat (read/manipulate) the data once it is on the plate. |
Why parquet-wasm is the winner for speed: Because it outputs data directly into the Arrow format, parquet-wasm allows you to hand that data off to other tools (like DuckDB, Arquero, or WebGL visualization libraries like Deck.gl) with Zero-Copy. The data doesn't need to be converted into slow JavaScript objects (like hyparquet does); it stays as a binary blob that the computer reads instantly.
===
duckdb-wasm
1. Is duckdb-wasm Standalone?
Yes and No.
Yes, it is a standalone Engine:
duckdb-wasmis a complete SQL database engine running in your browser. It does not rely onparquet-wasmorapache-arrowto function. It has its own internal C++ code (compiled to WASM) that handles Parquet parsing, query execution, and data management. You can drop it into a page, give it a Parquet URL, and run SQL without installing anything else.No, it is part of the "Modern Stack": While it can work alone, it is designed to plug directly into the Apache Arrow ecosystem.
- Input: It can read standard JS objects, but it is much faster if you feed it Apache Arrow tables.
- Output: When you run a query (
SELECT * FROM ...), the most efficient way to get results out is as an Apache Arrow table.
2. Relationship with other libraries
| Library | Relationship to duckdb-wasm |
|---|---|
| parquet-wasm | Competitor (sort of). Both libraries can read Parquet files. • Use duckdb-wasm if you need to run SQL queries (filter, join, group).• Use parquet-wasm if you just want to convert Parquet -> Arrow/JS as fast as possible without the overhead of a full SQL engine. |
| apache-arrow | Best Friend. duckdb-wasm uses Arrow as its "data interchange" layer. You will almost always install the apache-arrow library alongside duckdb-wasm so you can actually read the results of your SQL queries efficiently. |
3. License
It is extremely permissive.
- License: MIT License
- What this means: It is free for commercial use, modification, and distribution. You do not need to pay or open-source your own code to use it. This is the same license as React, Angular, and many other standard web tools.
Summary: The "Data Stack" Architecture
If you are building a serious data tool in 2025, your stack often looks like this:
- Storage: Parquet File (Server/S3)
- Engine:
duckdb-wasm(Downloads file, runs SQL, outputs Arrow) - Visualization:
apache-arrow(Reads the Arrow output from DuckDB to render charts/tables)
No comments:
Post a Comment