
Parallelize pandas apply
Dataset: event rows in Parquet
from pathlib import Path
import pandas as pd
import pyarrow.dataset as ds
from burla import remote_parallel_map
DATASET = "s3://my-bucket/events/"
OUT_DIR = Path("/workspace/shared/pandas-apply/enriched")
N_CHUNKS = 1_200Step 1: Pick a partition key
Step 2: Write the pandas function
Step 3: Smoke test one chunk
Step 4: Combine the enriched chunks
What's the point?
Last updated