Friday, April 24, 2026

DuckDB: in-process SQL

duckdb/duckdb: DuckDB is an analytical in-process SQL database management system @GitHub
C++, MIT license

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs, maps), and several extensions designed to make SQL easier to use.

DuckDB is available as a standalone CLI application and has clients for Python, R, Java, Wasm, etc., with deep integrations with packages such as pandas and dplyr.

For more information on using DuckDB, please refer to the DuckDB documentation.

DuckDB — Embedded Analytical SQL Engine

What It Is

DuckDB is an embedded, in-process analytical database — "SQLite for analytics." No separate server process, no external dependencies. Runs inside the host application.

Key difference from SQLite: columnar-vectorized query execution (processes batches of values) optimized for OLAP (analytical workloads), not OLTP (transactional).

Official Language Bindings

LanguageStatusNotes
C/C++OfficialNative API, primary interface
PythonOfficialMost popular binding
Node.js/JSOfficialnpm package
JavaOfficialJDBC driver
GoOfficialduckdb-go-bindings repo
RustOfficialduckdb-rs crate
ROfficial
WASMOfficial@duckdb/duckdb-wasm — full DuckDB in browser

All maintained by the DuckDB Foundation.

Not Officially Supported (DS3 codegen targets)

  • C# — no official binding (community wrappers may exist, or use C API via P/Invoke)
  • Swift — no official binding (use C API via bridging header)
  • Kotlin — can use Java JDBC driver
  • Dart — no official binding (use C API via FFI)

Deployment Modes

1. Embedded / In-Process (primary)

Runs inside the host application. No server, no IPC overhead. The generated program links against DuckDB and queries locally.

import duckdb
conn = duckdb.connect()
conn.sql("SELECT * FROM read_csv_auto('data.csv') WHERE age > 30")

2. HTTP Server (experimental)

Community httpserver extension — not first-class:

SELECT httpserve_start('0.0.0.0', 9999, 'user:pass');

Includes embedded web UI. Experimental/unstable.

3. Stdio / Unix Pipeline Tool

Works as a Unix command-line tool for piping:

cat data.csv | duckdb :memory: "SELECT * FROM read_csv_auto('/dev/stdin')"
duckdb :memory: "COPY (SELECT ...) TO '/dev/stdout' WITH (FORMAT csv, HEADER)"

Not designed as a persistent stdio server — no request/response loop like UCI or MCP. It's query-execute-exit, not a protocol.

4. WASM (Browser)

Full DuckDB compiled to WebAssembly via @duckdb/duckdb-wasm:

  • Runs in Chrome, Firefox, Safari, and Node.js
  • Reads Parquet, CSV, JSON from Filesystem APIs or HTTP
  • Speaks Apache Arrow natively
  • Client-side analytics without sending data to a server
  • Live demo: shell.duckdb.org

Native File Format Support

DuckDB can query these directly (no import step):

  • CSV — read_csv_auto('file.csv')
  • Parquet — read_parquet('file.parquet') — columnar, compressed, very fast
  • JSON — read_json_auto('file.json')
  • Excel — via spatial extension
  • HTTP/S3 — remote files via httpfs extension

This is exactly why DuckDB exploded in popularity — people need to query flat files with SQL.


DuckDB vs AWS Athena

AWS Athena uses Trino (formerly PrestoSQL) under the hood — not DuckDB.

Athena v1 was built on Presto (Facebook's distributed SQL engine). Athena v2 migrated to Trino (the open-source fork after the Presto community split). Athena v3 (2023+) added Apache Spark integration for non-SQL workloads.

They solve the same problem (SQL over files) at vastly different scales:

AspectAthena (Trino)DuckDB
ArchitectureDistributed, multi-nodeSingle-process, embedded
ScalePetabytes across S3Gigabytes, local
PricingPay per query (data scanned)Free, open source
SetupManaged serviceZero setup
File supportParquet, CSV, JSON, ORC, Avro on S3Same formats, local or HTTP
LatencySeconds (cluster spin-up)Milliseconds

Athena is the "enterprise cloud" answer; DuckDB is the "just query this file on my laptop" answer.

MotherDuck is the cloud/managed version of DuckDB — "DuckDB as a service," closer to what Athena does but with DuckDB's engine. Supports hybrid local+cloud queries.

No comments: