Page cover

Process database rows

In this example we:

  • Split an indexed PostgreSQL table into non-overlapping ID ranges.

  • Run one worker per range.

  • Return small aggregate reports instead of raw rows.

  • Use max_parallelism so the database remains the constraint, not the cluster.

This is the pattern I would use for a backfill where the source of truth is still the database. The goal is not to replace SQL. The goal is to run ordinary Python over many row ranges without turning one script into a queueing system.

Dataset: an orders table

Assume the table has an indexed integer id column and a status, amount, and updated_at column.

The workers need a database URL they can reach from the Burla cluster. Do not use localhost unless the database is actually inside the worker.

import os
from dataclasses import dataclass

import psycopg2
from burla import remote_parallel_map

DATABASE_URL = os.environ["DATABASE_URL"]
ROWS_PER_RANGE = 25_000
MAX_DB_CONNECTIONS = 20

Step 1: Build ID ranges

First ask the database for the range you intend to scan. Then split that range into jobs.

ID ranges are easy to reason about because they do not overlap. They also make reruns obvious: rerun the failed ranges.

Step 2: Process one range

Each worker opens its own database connection, runs one bounded query, and returns a small aggregate.

If the worker needs to update rows, make the write idempotent. For read-only analytics, keep it read-only.

Step 3: Smoke test one range

Test one small range before opening many database connections.

This catches network access, credentials, package installs, and SQL mistakes before the full backfill starts.

Step 4: Run the full range list

max_parallelism is the important line. It is the number of live workers allowed to hit the database at once.

Step 5: Reduce the results

The client combines the small per-range reports.

What's the point?

The database is usually the bottleneck, so the best version of this job is explicit about database pressure.

The cluster can run thousands of workers. That does not mean your database wants thousands of connections. Split by indexed ranges, keep the worker query bounded, return small results, and cap concurrency where the real constraint lives.

Last updated