Process thousands of files quickly.

A pattern for processing many files in parallel.

If you have a lot of files, the fastest pattern is usually:

  1. give each file to one function call

  2. run many function calls in parallel

  3. save each output to /workspace/shared

This keeps your script simple and makes it easy to scale up.

Before you start

Make sure you have already:

  1. installed Burla: pip install burla

  2. connected your machine: burla login

  3. started your cluster in the Burla dashboard

For shared filesystem details, read Read and Write GCS Files.

Step 1: Build a list of file paths

Start with a list of input files.

from pathlib import Path

input_file_paths = [str(path) for path in Path("/workspace/shared/logs/raw").glob("*.txt")]

Each path in the list becomes one parallel function call.

Step 2: Write one file-processing function

This function reads one input file and writes one output file.

Step 3: Run all files in parallel

Use remote_parallel_map with your function and file path list.

Step 4: Combine the per-file outputs into one report

You can keep this final combine step simple.

Step 5: Scale up safely

Before you run thousands of files, test with a small subset first.

When the small test works, run the full list.

What to do next

If you have one very large file instead of many small files, continue with Process one giant file quickly.