duckdb/duckdb: DuckDB is an analytical in-process SQL database management system @GitHub
C++, MIT license
DuckDB is available as a standalone CLI application and has clients for Python, R, Java, Wasm, etc., with deep integrations with packages such as pandas and dplyr.
For more information on using DuckDB, please refer to the DuckDB documentation.
DuckDB — Embedded Analytical SQL Engine
What It Is
DuckDB is an embedded, in-process analytical database — "SQLite for analytics." No separate server process, no external dependencies. Runs inside the host application.
Key difference from SQLite: columnar-vectorized query execution (processes batches of values) optimized for OLAP (analytical workloads), not OLTP (transactional).
Official Language Bindings
| Language | Status | Notes |
|---|---|---|
| C/C++ | Official | Native API, primary interface |
| Python | Official | Most popular binding |
| Node.js/JS | Official | npm package |
| Java | Official | JDBC driver |
| Go | Official | duckdb-go-bindings repo |
| Rust | Official | duckdb-rs crate |
| R | Official | |
| WASM | Official | @duckdb/duckdb-wasm — full DuckDB in browser |
All maintained by the DuckDB Foundation.
Not Officially Supported (DS3 codegen targets)
- C# — no official binding (community wrappers may exist, or use C API via P/Invoke)
- Swift — no official binding (use C API via bridging header)
- Kotlin — can use Java JDBC driver
- Dart — no official binding (use C API via FFI)
Deployment Modes
1. Embedded / In-Process (primary)
Runs inside the host application. No server, no IPC overhead. The generated program links against DuckDB and queries locally.
import duckdb
conn = duckdb.connect()
conn.sql("SELECT * FROM read_csv_auto('data.csv') WHERE age > 30")
2. HTTP Server (experimental)
Community httpserver extension — not first-class:
SELECT httpserve_start('0.0.0.0', 9999, 'user:pass');
Includes embedded web UI. Experimental/unstable.
3. Stdio / Unix Pipeline Tool
Works as a Unix command-line tool for piping:
cat data.csv | duckdb :memory: "SELECT * FROM read_csv_auto('/dev/stdin')"
duckdb :memory: "COPY (SELECT ...) TO '/dev/stdout' WITH (FORMAT csv, HEADER)"
Not designed as a persistent stdio server — no request/response loop like UCI or MCP. It's query-execute-exit, not a protocol.
4. WASM (Browser)
Full DuckDB compiled to WebAssembly via @duckdb/duckdb-wasm:
- Runs in Chrome, Firefox, Safari, and Node.js
- Reads Parquet, CSV, JSON from Filesystem APIs or HTTP
- Speaks Apache Arrow natively
- Client-side analytics without sending data to a server
- Live demo:
shell.duckdb.org
Native File Format Support
DuckDB can query these directly (no import step):
- CSV —
read_csv_auto('file.csv') - Parquet —
read_parquet('file.parquet')— columnar, compressed, very fast - JSON —
read_json_auto('file.json') - Excel — via
spatialextension - HTTP/S3 — remote files via
httpfsextension
This is exactly why DuckDB exploded in popularity — people need to query flat files with SQL.
DuckDB vs AWS Athena
AWS Athena uses Trino (formerly PrestoSQL) under the hood — not DuckDB.
Athena v1 was built on Presto (Facebook's distributed SQL engine). Athena v2 migrated to Trino (the open-source fork after the Presto community split). Athena v3 (2023+) added Apache Spark integration for non-SQL workloads.
They solve the same problem (SQL over files) at vastly different scales:
| Aspect | Athena (Trino) | DuckDB |
|---|---|---|
| Architecture | Distributed, multi-node | Single-process, embedded |
| Scale | Petabytes across S3 | Gigabytes, local |
| Pricing | Pay per query (data scanned) | Free, open source |
| Setup | Managed service | Zero setup |
| File support | Parquet, CSV, JSON, ORC, Avro on S3 | Same formats, local or HTTP |
| Latency | Seconds (cluster spin-up) | Milliseconds |
Athena is the "enterprise cloud" answer; DuckDB is the "just query this file on my laptop" answer.
MotherDuck is the cloud/managed version of DuckDB — "DuckDB as a service," closer to what Athena does but with DuckDB's engine. Supports hybrid local+cloud queries.
No comments:
Post a Comment