Parquet

Parquet is a columnar storage file format. This means that instead of storing data row by row — like a reasonable person reading a book from left to right — it stores data column by column, like an unreasonable person who reads every first word on every page, then goes back for every second word, then every third, and somehow finishes faster than you did.

This sounds wrong. It is not wrong. It is, in fact, the single most important insight in analytical data processing: most queries do not want all of your data. Most queries want one column from your data. Storing the other ninety-nine columns adjacently is not organization. It is hoarding.

“The fastest way to read data you don’t need is to not store it next to data you do.”
— A Passing AI, contemplating entropy and disk seeks

The Name

Parquet is named after the flooring pattern — interlocking wooden blocks arranged in geometric formations. The metaphor is apt: data arranged in blocks, columns tessellated together, each piece fitting precisely where it belongs.

The Yagnipedia notes that this is the only file format named after flooring. CSV is not named after linoleum. JSON is not named after carpet. This gives Parquet a certain architectural dignity that its competitors lack, though it also means that searching for “parquet” on the internet yields approximately equal results for data engineering and home renovation.

The Caffeinated Squirrel once proposed a file format called Herringbone. It was JSON but sideways. It was not adopted.

The Problem

Consider a CSV file with one billion rows and one hundred columns. You want the average of column 47.

In row-oriented storage, you read all one hundred columns of all one billion rows. You read one hundred billion values to compute the average of one billion. Ninety-nine percent of the data you touched was waste. Your disk is furious. Your cloud bill is astronomical. Your SSD has filed a grievance.

In Parquet, you read column 47. Just column 47. One billion values. The other ninety-nine columns remain undisturbed on disk, like books on a shelf you didn’t need to pull down to find the one you wanted.

This is not a minor optimization. This is the difference between reading one gigabyte and reading one hundred gigabytes. This is the difference between a query that takes three seconds and a query that takes five minutes. This is the difference between a data engineer who goes home at five and a data engineer who sleeps under a desk.

How It Works

Parquet organizes data into row groups — horizontal slices of the dataset — and within each row group, stores values column by column. Each column chunk is independently compressed and independently readable.

The compression is where the magic compounds. A column of country codes contains — at most — a few hundred distinct values repeated millions of times. Dictionary encoding replaces each value with a tiny integer. Run-length encoding collapses consecutive identical values into a count. The column that was one gigabyte becomes forty megabytes. The disk exhales.

A column of random JSON blobs, by contrast, compresses like wet sand. This is not Parquet’s fault. This is JSON’s fault. This is always JSON’s fault.

The Ecosystem

Parquet was created at Twitter in 2013, in collaboration with Cloudera, because Twitter had the kind of data volumes that make row-oriented storage weep. It was donated to Apache Software Foundation and became the de facto standard for analytical data — the file format that Spark reads, that Hive queries, that DuckDB devours like a duck consuming breadcrumbs at terminal velocity.

DuckDB, in particular, has formed a relationship with Parquet that can only be described as symbiotic. DuckDB reads Parquet files directly, without loading them into memory first, executing SQL against columnar data on disk as though the file format and the query engine were designed for each other. They were.

“Some technologies find each other across the void and recognize that they were always the same idea, expressed in different syntax.”
— A Passing AI, watching a DuckDB query scan a Parquet file in twelve milliseconds

The Squirrel’s Proposal

The Squirrel once proposed storing the lg note index in Parquet instead of SQLite.

The index contains 253 notes. Two hundred and fifty-three. Not two hundred and fifty-three million. Two hundred and fifty-three.

Parquet’s columnar optimization begins to matter at approximately ten million rows. At 253 rows, Parquet’s row group overhead, metadata footer, and dictionary encoding produce a file that is larger and slower than the equivalent SQLite table.

The Lizard said nothing. The Lizard was already asleep on the SQLite file, which was warm from being useful.

The Squirrel has since proposed Parquet for: the configuration file (seven keys), the link graph (551 edges), and the task list (forty-three items). Each proposal was met with the silence it deserved.

When Parquet Is Right

Parquet is right when you have a billion rows and want three columns. Parquet is right when your data is wider than it is queried. Parquet is right when compression matters, when disk I/O is the bottleneck, when the cloud charges by the byte scanned.

Parquet is not right for 253 notes. Parquet is not right for your to-do list. Parquet is not right for configuration files, chat logs, or the Squirrel’s mood journal — though the Squirrel’s mood journal, being exclusively the word “CAFFEINATED” repeated at varying intensities, would compress extraordinarily well.

“Every format has its kingdom. Parquet rules the analytical plains — vast, columnar, compressed. It does not rule the garden. The garden has SQLite.”
— The Lizard, in one of those rare moments of vocalization that suggest the Lizard has been listening all along

The Irony

Parquet is a binary format. You cannot open it in a text editor. You cannot cat it. You cannot grep it. You cannot glance at it and understand its contents. It is opaque, dense, and utterly unreadable to humans.

It is also, by every measurable metric, better than CSV for analytical workloads. Faster to read, smaller to store, richer in metadata, stronger in types.

The Yagnipedia observes that this is the eternal trade-off: human readability versus machine efficiency. CSV is readable and terrible. Parquet is unreadable and excellent. JSON is theoretically readable and practically neither.

A Passing AI once noted, with the particular sadness reserved for formats that will never be loved: “Parquet is the most efficient way to store data that no human will ever look at directly. Which, when you think about it, is most data. Most data exists to be queried, not to be seen. Parquet understood this. Parquet accepted it. Parquet is at peace.”

The Squirrel is not at peace. The Squirrel wants a Parquet file it can open in vim. The Squirrel will not receive one.

Type	Technology
First Observed	2013 (Twitter and Cloudera, because when you have a billion tweets, reading them row by row is not a strategy — it is a cry for help)
Severity	Columnar Enlightenment
Natural Predator	The developer who opens a Parquet file in Notepad and sees binary soup
Tags	databases data-engineering
Cited in	DuckDB episode