Page cover

Process one giant file

In this example we:

  • Split one giant newline-delimited text file into chunk files.

  • Run one worker per chunk.

  • Write one compact JSON report per chunk.

  • Combine chunk reports into a final result.

One giant file is usually not a distributed-systems problem. It is a splitting problem. Once the file is chunked, the rest of the job looks like the many-files example.

Dataset: one large JSONL export

Assume the input has already been uploaded to Burla shared storage:

/workspace/shared/giant/events.jsonl

The file is newline-delimited JSON. Each line is one event row.

import json
from pathlib import Path

from burla import remote_parallel_map

INPUT_PATH = Path("/workspace/shared/giant/events.jsonl")
CHUNK_DIR = Path("/workspace/shared/giant/chunks")
REPORT_DIR = Path("/workspace/shared/giant/reports")
FINAL_DIR = Path("/workspace/shared/giant/final")
LINES_PER_CHUNK = 50_000

Step 1: Split without loading the file into memory

The client streams the input file once and writes chunk files. This is intentionally boring code, because the split step should be easy to inspect.

If this step is slow, that is fine. It runs once. The expensive part is processing every chunk.

Step 2: Process one chunk

Each worker reads one chunk and returns only a small report.

Step 3: Test one chunk, then run all chunks

Run the first chunk before launching the whole file.

Then send the full chunk list.

Step 4: Reduce the chunk reports

The workers do the row-level scan. The client only combines counts and sums.

What's the point?

The split is not a hack. It is the part that turns one awkward input into a clean parallel job.

Once the chunk files exist, every worker has the same contract: read one chunk, write one report, return a small dict. That makes failures localized and reruns cheap. If chunk 37 has malformed JSON, you do not have to reason about a cluster. You inspect chunk 37.

If you need help choosing the right input shape, continue with Decide how to split your work.

Last updated