Process 2.4TB of Parquet Files in 76s
With <30 lines of Python.
In this example we:
Generate 1,000 billion-row Parquet files (2.4TB) and store them in Google Cloud Storage.
Run a DuckDB query on each file in parallel using a cluster with 10,000 CPUs.
Combine resulting dataframes to produce the final result!
What is the trillion row challenge?
An extension of the billion row challenge, the goal of the trillion row challenge is to compute the min, max, and mean temperature per weather station, for 413 unique stations, from data stored as a collection of parquet files in blob storage. Data looks like this (but with 1,000,000,000,000 rows):

Step 1: Boot some VMs
For this challenge we use used a 125 node cluster with 80 cpus and 320G ram per node. Underneath these are N4-standard-80 machines. The cluster settings look like this:

Once the settings look good we just hit "⏻ Start" on the front page. This cluster took 1min 47s to boot.
Step 2: Generate the dataset
We chose to split the 1T row dataset into 1,000 parquet files. With 10k cpus, this means we only need to download one file per 10-cpu worker.
Once generated, parquet files are written to the ./shared folder. Anything placed in this folder is synchronized with a Google Cloud Storage bucket using GCSFuse. The dashboard has a built in file manager where you can monitor, upload, and download files from the cluster (screenshot below).
Below, each call to generate_parquet is done in a python:3.12 docker container (see settings ↑), this can be any docker container as long as your VM's service account is authorized to pull it.
Python packages your code uses that aren't in the container are detected and installed quickly at runtime. The code below took 3s to install pandas, numpy, and pyarrow in every container.
This code completed in 5m 48s! The final dataset is 2.4TB. After running, files appear under the "Filesystem" tab (underneath this is a GCS Bucket).

Step 3: Run the challenge!
This code runs a simple DuckDB query on all 1,000 Parquet files at the same time. Each query returns a pandas dataframe with the min/mean/max per station within that file. DataFrames are concatenated, and aggregated locally to produce a final min/mean/max temperature per weather station.
Like the data generation code above, every call to station_stats runs in a python:3.12 docker container, and any missing python packages are quickly installed at runtime. Parquet files pulled from the ./shared folder are actually being downloaded from Google Cloud Storage using GCSFuse.
To avoid using any cached files/data the cluster was rebooted before running the code below.
This only took 76s to finish, which I believe is technically a new record! (on the full 2.4TB dataset). However, as I talk more about below, I don't actually think pure runtime is what really matters most. I also think, with the right optimization, this could be done in <5s!
Output:
Live Demo:
How expensive was this?
For this demo we used spot instances. These are extra machines rented at a discount that might be deleted ("preempted") at any time when Google needs them. This isn't an issue for us because Burla jobs continue working even if some nodes are preempted while in the middle of a job.
Our cluster used 125 N4-standard-80 machines, took 1m47s to boot, and was shut down 1.5min later. To be safe let's assume the billable runtime per node was 3.5min, with 125 nodes this is 7.3 hours.
At spot pricing N4-standard-80 machines cost $1.22/hour meaning this job cost about $8.91.
What's the point?
76s is a respectable time, but the real question I'm trying to answer is: If I were in the office on a busy day, and needed to process a bunch of stuff, what would I do? How long would it take? and how expensive would it be?
Let's be honest, this isn't the most perfectly efficient solution in the world. Most of the time is spent downloading data, and while GCSFuse is convenient, it probably isn't maxing out the VM's network capacity.
But, if I were in the office, and you asked me to get you the min/mean/max per station, assuming I'd never heard of this challenge before, I'd have an answer for you around 5 minutes later, and for less than $10. In my opinion this is the real result, and I think it's an impressive one!
Not to mention, I'd do it all using an interface a beginner can understand!