Page cover

Find NOAA's rainiest day

In this example we:

  • Scan every GHCN-Daily year file from 1750 through today.

  • Keep daily precipitation records with clean quality flags.

  • Reduce 3,177,336,585 rows into a global rain leaderboard.

The run found 1,750.0 mm at Koumac, New Caledonia on 1976-01-17.

Dataset: NOAA GHCN-Daily by-year files

NOAA publishes one compressed CSV per year. That makes the input list obvious.

import csv
import gzip
import heapq
import io
import json
from datetime import date
from pathlib import Path

import requests
from burla import remote_parallel_map

BASE = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year"
PART_DIR = Path("/workspace/shared/ghcn-rain/parts")
FINAL_DIR = Path("/workspace/shared/ghcn-rain/final")
TOP_PER_YEAR = 100
START_YEAR = 1750
END_YEAR = date.today().year

Step 1: Stream one year per worker

Each worker downloads one compressed year file and streams it row by row.

Step 2: Keep a heap, not the whole file

The worker filters precipitation rows, applies the unit conversion, and keeps only the top records for that year.

The worker never holds a full year in memory. It holds a 100-record heap.

Step 3: Smoke test one year

Step 4: Reduce the years

The reducer merges yearly heaps, joins station metadata, computes country-decade stats, and renders the map.

What's the point?

Extreme weather questions punish shortcuts. A clean modern subset can miss the actual record. A gridded product can be better for averages, but it smears the point measurement you need here.

The important thing is that the filtering rule is code, not prose: PRCP, tenths of millimeters, empty quality flag, no missing value. Those details decide the result.

Last updated