Page cover

Parallelize pandas apply

In this example we:

  • Split a Parquet event table by user id.

  • Run normal df.apply(..., axis=1) on each worker.

  • Write enriched Parquet chunks to shared storage.

  • Concatenate the chunk outputs into one final dataset.

Sometimes the ugly pandas row function is the business logic. Rewriting it into Spark just to make it run faster can quietly change the answer.

Dataset: event rows in Parquet

Assume the source dataset has user_id, url, and other event fields.

from pathlib import Path

import pandas as pd
import pyarrow.dataset as ds
from burla import remote_parallel_map

DATASET = "s3://my-bucket/events/"
OUT_DIR = Path("/workspace/shared/pandas-apply/enriched")
N_CHUNKS = 1_200

Step 1: Pick a partition key

Here we stripe user ids across 1,200 chunks. The important part is that every row belongs to exactly one chunk.

If you already have a user manifest, use that instead. The point is to create stable, non-overlapping inputs.

Step 2: Write the pandas function

The worker reads its slice, applies the same row function you would run locally, writes one Parquet chunk, and returns a small report.

The row function stays ugly if the real business logic is ugly. The scaling change is around it, not inside it.

Step 3: Smoke test one chunk

Run one chunk before launching 1,200 workers.

Then run the full job.

Step 4: Combine the enriched chunks

The client reads the output files after the parallel work finishes.

What's the point?

A lot of useful transforms are full of regexes, JSON parsing, weird buckets, and old product rules. They are not beautiful. They are just correct.

This lets you keep that code and change the amount of hardware underneath it. For a one-off backfill or migration, that is often the most honest thing to do. The worker owns one chunk, the output is inspectable, and the final combine step is ordinary pandas.

Last updated