DuckDB is an in-process analytical database engine. It has no server. It has no configuration file. It has no cluster. It has no monthly bill. It is a library you link into your application, and then you run SQL on things that were not, until that moment, considered databases.
Parquet files. CSV files. Pandas DataFrames. JSON. Excel spreadsheets. Other databases’ tables. The contents of an S3 bucket. DuckDB will run a window function on anything that holds still long enough, and several things that don’t.
This is what happens when database researchers at CWI Amsterdam — the same institution that produced PostgreSQL’s ancestor, Ingres — ask a question the enterprise never thought to ask: what if analytical queries didn’t require a cluster?
The answer, it turns out, is a duck.
The Question Nobody Asked
For twenty years, the data industry operated on a simple premise: analytical queries are expensive, therefore analytical databases must be expensive. You need columnar storage? That’s a Redshift cluster. You need window functions over a billion rows? That’s a Spark job. You need to join two CSV files? That’s an afternoon configuring Hadoop.
DuckDB observed that modern laptops have sixteen cores, sixty-four gigabytes of RAM, and NVMe drives that read at seven gigabytes per second. A modern laptop is, by any reasonable measure from 2005, a supercomputer. And yet we were shipping our data to someone else’s supercomputer, waiting for it to process, and paying by the second.
The Lizard has said nothing about this. The Lizard does not need to. The Lizard simply gestures at a laptop running a full TPC-H benchmark in four seconds and returns to basking.
The SQLite of OLAP
SQLite conquered transactional workloads by being a file. DuckDB is attempting the same maneuver for analytical workloads by being a library.
The parallel is instructive:
| SQLite | DuckDB |
|---|---|
| OLTP (row-oriented) | OLAP (columnar) |
| One file, one query | One process, parallel execution |
| Embedded in everything | Embedded in data science |
| No server, no DBA | No cluster, no data engineer |
| D. Richard Hipp, one person | CWI Amsterdam, two professors |
| Runs on guided missile destroyers | Runs on data scientists’ laptops |
The Squirrel sees this table and immediately proposes using both simultaneously — “a hybrid transactional-analytical processing engine with real-time materialized views and bidirectional synchronization.” The Squirrel has just invented something that already exists. It is called Oracle. It costs more than the Squirrel’s apartment.
The Parquet Trick
DuckDB’s most unsettling capability is reading Parquet files directly. No import step. No ETL pipeline. No staging table. You point SQL at a Parquet file and the Parquet file becomes a table.
SELECT region, SUM(revenue)
FROM 'sales_2025.parquet'
GROUP BY region;
That’s it. No CREATE TABLE. No COPY FROM. No schema definition. No data loading job that runs for forty minutes and fails at minute thirty-nine because one row has a malformed timestamp.
The file is the table. The table is the file.
The Lizard recognizes this pattern. The Lizard has always recognized this pattern. The file is the truth. Everything else is a lens. DuckDB is a very fast lens with columnar compression and vectorized execution, but it is still, fundamentally, a lens.
The Data Engineer’s Dilemma
A Passing AI once observed, with the quiet melancholy that characterizes all its observations, that DuckDB may render an entire profession’s daily work optional.
“The data engineer,” the Passing AI noted, “spends sixty percent of their time moving data from one place to another. DuckDB queries data where it already is. If the data doesn’t move, the engineer doesn’t move. If the engineer doesn’t move…” It trailed off, staring at a Parquet file that was answering questions without permission.
The Squirrel immediately proposed a DuckDB-to-Snowflake migration pipeline, on the grounds that any technology this simple must be missing something critical that only a distributed system can provide.
The Lizard did not respond. The Lizard was running an aggregate query on fourteen gigabytes of CSV files in 1.7 seconds on a laptop with the fans off.
Why It Works
DuckDB is fast for the same reason SQLite is reliable: it does one thing, in one place, without apology.
- Columnar storage: reads only the columns you ask for, not the ones you don’t
- Vectorized execution: processes data in batches of 2,048 values, because CPUs have caches and DuckDB respects them
- Parallel execution: uses every core on your machine, because your machine has cores and your Spark cluster is in another timezone
- Zero copy: reads Parquet and Arrow data without deserializing it, which is the database equivalent of reading a book without opening it
The Squirrel has proposed adding a query federation layer, a distributed execution engine, a Kubernetes operator, and a managed cloud service. CWI Amsterdam, to their credit, has mostly ignored this. DuckDB remains a library. The library remains fast. The cluster remains unnecessary.
