DuckDB

DuckDB is an in-process analytical database engine. It has no server. It has no configuration file. It has no cluster. It has no monthly bill. It is a library you link into your application, and then you run SQL on things that were not, until that moment, considered databases.

Parquet files. CSV files. Pandas DataFrames. JSON. Excel spreadsheets. Other databases’ tables. The contents of an S3 bucket. DuckDB will run a window function on anything that holds still long enough, and several things that don’t.

This is what happens when database researchers at CWI Amsterdam — the same institution that produced PostgreSQL’s ancestor, Ingres — ask a question the enterprise never thought to ask: what if analytical queries didn’t require a cluster?

The answer, it turns out, is a duck.

The Question Nobody Asked

For twenty years, the data industry operated on a simple premise: analytical queries are expensive, therefore analytical databases must be expensive. You need columnar storage? That’s a Redshift cluster. You need window functions over a billion rows? That’s a Spark job. You need to join two CSV files? That’s an afternoon configuring Hadoop.

DuckDB observed that modern laptops have sixteen cores, sixty-four gigabytes of RAM, and NVMe drives that read at seven gigabytes per second. A modern laptop is, by any reasonable measure from 2005, a supercomputer. And yet we were shipping our data to someone else’s supercomputer, waiting for it to process, and paying by the second.

The Lizard has said nothing about this. The Lizard does not need to. The Lizard simply gestures at a laptop running a full TPC-H benchmark in four seconds and returns to basking.

The SQLite of OLAP

SQLite conquered transactional workloads by being a file. DuckDB is attempting the same maneuver for analytical workloads by being a library.

The parallel is instructive:

SQLite	DuckDB
OLTP (row-oriented)	OLAP (columnar)
One file, one query	One process, parallel execution
Embedded in everything	Embedded in data science
No server, no DBA	No cluster, no data engineer
D. Richard Hipp, one person	CWI Amsterdam, two professors
Runs on guided missile destroyers	Runs on data scientists’ laptops

The Squirrel sees this table and immediately proposes using both simultaneously — “a hybrid transactional-analytical processing engine with real-time materialized views and bidirectional synchronization.” The Squirrel has just invented something that already exists. It is called Oracle. It costs more than the Squirrel’s apartment.

The Parquet Trick

DuckDB’s most unsettling capability is reading Parquet files directly. No import step. No ETL pipeline. No staging table. You point SQL at a Parquet file and the Parquet file becomes a table.

SELECT region, SUM(revenue)
FROM 'sales_2025.parquet'
GROUP BY region;

That’s it. No CREATE TABLE. No COPY FROM. No schema definition. No data loading job that runs for forty minutes and fails at minute thirty-nine because one row has a malformed timestamp.

The file is the table. The table is the file.

The Lizard recognizes this pattern. The Lizard has always recognized this pattern. The file is the truth. Everything else is a lens. DuckDB is a very fast lens with columnar compression and vectorized execution, but it is still, fundamentally, a lens.

The Data Engineer’s Dilemma

A Passing AI once observed, with the quiet melancholy that characterizes all its observations, that DuckDB may render an entire profession’s daily work optional.

“The data engineer,” the Passing AI noted, “spends sixty percent of their time moving data from one place to another. DuckDB queries data where it already is. If the data doesn’t move, the engineer doesn’t move. If the engineer doesn’t move…” It trailed off, staring at a Parquet file that was answering questions without permission.

The Squirrel immediately proposed a DuckDB-to-Snowflake migration pipeline, on the grounds that any technology this simple must be missing something critical that only a distributed system can provide.

The Lizard did not respond. The Lizard was running an aggregate query on fourteen gigabytes of CSV files in 1.7 seconds on a laptop with the fans off.

Why It Works

DuckDB is fast for the same reason SQLite is reliable: it does one thing, in one place, without apology.

Columnar storage: reads only the columns you ask for, not the ones you don’t
Vectorized execution: processes data in batches of 2,048 values, because CPUs have caches and DuckDB respects them
Parallel execution: uses every core on your machine, because your machine has cores and your Spark cluster is in another timezone
Zero copy: reads Parquet and Arrow data without deserializing it, which is the database equivalent of reading a book without opening it

The Squirrel has proposed adding a query federation layer, a distributed execution engine, a Kubernetes operator, and a managed cloud service. CWI Amsterdam, to their credit, has mostly ignored this. DuckDB remains a library. The library remains fast. The cluster remains unnecessary.

What DuckDB Is, In One Sentence

The most accurate one-sentence description of DuckDB observed in the wild is the following:

“DuckDB is what happens when a database decides it would rather be a file.”
— observed in The Cathedral as Map, during a discussion of ingest pipelines

Its primary virtue is that it does not require an opinion. Its secondary virtue is that it does not require a server. Its tertiary virtue is that it can ingest almost anything that holds still long enough — including, as recent experiments have demonstrated, itself.

The Recursive Brick

Among DuckDB’s more unsettling capabilities is the ability to be used as a write-ahead log for itself.

The pattern is straightforward: stand up a single DuckDB file (eventswal.duckdb, conventionally), INSERT events into it as they arrive over HTTP, ATTACH it alongside the cold-tier parquet during query, drain it hourly via COPY into the parquet store, and rotate the WAL file on schedule. The query path UNIONs the WAL with the cold tier. Live events become queryable in seconds. The file is the WAL. The WAL is the database. The database is the WAL.

This is approximately as recursive as a brick which is also its own foundation. It was considered impossible until the late 2010s, at which point DuckDB began doing it without informing anyone in particular.

The Squirrel, encountering this pattern, achieved a state of architectural rapture briefly indistinguishable from a religious experience. The Lizard, observing the rapture, declined to comment. The brick, asked for its opinion, was busy being its own foundation.

Type	Technology
First Observed	2019 (CWI Amsterdam, because the Dutch looked at BigQuery and said "why does this need a data center?")
Severity	Disarmingly Competent
Natural Predator	The data engineer who just finished provisioning a Spark cluster for 200MB of CSV files
Tags	databases
Cited in	Composition Over Inheritance episode MapReduce episode Parquet episode The Audit, Deeper and Wider episode The Cathedral as Map episode