Large-Scale Data Processing
Large-scale data processing examples for files, corpora, table scans, and ordinary Python data code.

Query 2.4TB Parquet in 76s
Run a DuckDB query over 1,000 Parquet files on 10,000 CPUs and combine the results.

Distill 571M Amazon reviews
Read 275 GB of JSONL with HTTP Range requests, deterministic scoring, heap reducers, and a second Unhinged Mode pass.

Scan 2.76B NYC taxi trips
Scan taxi and FHV month files, keep pickup counts small, and classify zones after the full scan.

Map geotagged Flickr photos
Reverse-geocode public photos and build country signatures from user-written tags.
Last updated


