Page cover

Process thousands of files

In this example we:

  • Read a folder of raw log files from Burla shared storage.

  • Run one Python function call per file.

  • Write one compact JSON report per file.

  • Combine the per-file reports into a single error summary.

This is the first pattern I would reach for when the input is already split into files. Do not make the worker aware of the whole dataset. Give it one file, make it produce one small report, then reduce those reports after the parallel work is done.

Dataset: raw application logs

Assume a daily log export has already been uploaded to:

/workspace/shared/logs/raw/

Every worker can read that folder. Anything the workers write back under /workspace/shared is visible to the client and to later workers.

import json
from pathlib import Path

from burla import remote_parallel_map

RAW_DIR = Path("/workspace/shared/logs/raw")
REPORT_DIR = Path("/workspace/shared/logs/reports")
FINAL_DIR = Path("/workspace/shared/logs/final")

Step 1: Build the work list

The client does the cheap planning step. Each path in input_paths becomes one function call.

If this list is empty, fix the upload or path before thinking about parallelism.

Step 2: Process one file

The worker reads one file, counts the lines that matter, writes a small JSON report, and returns the report path.

Return dictionaries for small metadata. Write larger outputs to shared storage and return paths.

Step 3: Smoke test a few files

Run a small slice first. This catches path mistakes, encoding problems, package issues, and bad assumptions about the file format.

If the reports look right, launch the full file list.

Step 4: Reduce the reports

The reduce step runs locally because the result list is small.

What's the point?

The useful abstraction is not "run logs on a cluster." It is one file per worker, one report per file, one small reduce at the end.

That shape is easy to debug. If one report looks wrong, you know exactly which input produced it. If one file fails, rerun that file. If the job grows from 1,000 files to 100,000 files, the function body does not change.

If you have one very large file instead of many small files, continue with Process one giant file quickly.

Last updated