Page cover

Scan 2.76B NYC taxi trips

In this example we:

  • Download every public TLC monthly Parquet file from 2011 through 2024.

  • Count pickups for yellow, green, FHV, and HVFHS trips.

  • Build a zone-by-month matrix across all 264 taxi zones.

  • Classify zones by the shape of their own time series.

I wanted to find ghost neighborhoods, but only after including the ride-share data. Otherwise you are mostly measuring the limits of yellow cab coverage.

Dataset: NYC TLC monthly trip files

The public files are already split by taxi type, year, and month. That is the work queue.

import io
from dataclasses import dataclass

import pandas as pd
import pyarrow.parquet as pq
import requests
from burla import remote_parallel_map

BASE = "https://d37ci6vzurychx.cloudfront.net/trip-data"
TAXI_TYPES = ["yellow", "green", "fhv", "fhvhv"]
YEARS = range(2011, 2025)

@dataclass(frozen=True)
class MonthJob:
    taxi_type: str
    year: int
    month: int

def monthly_url(taxi_type: str, year: int, month: int) -> str:
    return f"{BASE}/{taxi_type}_tripdata_{year}-{month:02d}.parquet"

Step 1: Make one task per month file

The client builds one input per possible file. Missing files are handled inside the worker, because not every taxi type exists for the full time range.

Step 2: Count pickups for one file

Each worker downloads one Parquet file and returns pickup counts by zone. It does not send raw trips back to the client.

The missing-file behavior matters. A public corpus this old will have schema and availability edges.

Step 3: Smoke test a few months

Run a mixed slice first so the code sees both old and new formats.

Then scan the full corpus.

Step 4: Build the time series

The client reduces the monthly counts into a zone-by-month matrix and classifies the shape of each zone.

The classification happens after the scan, when every feed and every month is visible.

What's the point?

Mobility data is full of traps. Yellow cabs, green cabs, app-based FHVs, and high-volume FHVs do not appear in the public data at the same time. If you scan one feed, you can mistake a reporting change for a neighborhood change.

This version keeps the question honest. Count every month, keep the output small, and do the interpretation after the scan. Then you can change the ghost definition without redownloading 2.76 billion trips.

Last updated