Page cover

Limit parallelism for APIs or databases

Keep Burla jobs inside external service limits.

Use this when the slowest or most fragile part of your job is outside Burla. Do not use every available CPU when an API quota, website, database, or model provider is the real limit. The unit of work is usually a chunk of IDs, URLs, prompts, files, or database ranges. Each worker should reuse one client or connection inside that chunk. The output should include successes and failures so you can retry only the work that failed.

Parallelism is not always the target. Sometimes the target is finishing the whole job without breaking the contract with another system.

Start from the external limit

Write down the real limit first.

Examples:

  1. API: 1,000 requests per second

  2. website: 2 requests per second per worker, plus a global worker cap

  3. Postgres: 200 safe write connections

  4. LLM provider: 60,000 tokens per minute

  5. vector database: 100 concurrent upsert batches

Then choose:

  1. chunk size

  2. per-worker pacing

  3. max_parallelism

The rough formula is:

global throughput = live workers * per-worker throughput

If each worker makes one request per second and you set max_parallelism=500, your job tries to make about 500 requests per second.

Chunk IDs for an API backfill

Plan chunks on the client.

Put pacing and provider behavior next to the HTTP call.

Cap live workers with max_parallelism.

The JSONL file is the output and the retry manifest. Failed rows are visible.

Keep one database connection per worker

For databases, count connections before CPUs.

If each worker opens one connection and Postgres can safely handle 80 write connections, start with max_parallelism=80.

The bottleneck here is not Python. It is the sink.

Be polite to websites

For static pages, one worker should keep one HTTP client open for a chunk of URLs.

For pages that need JavaScript, use a browser image or a browser-specific tool. Do not pretend httpx tested the same thing.

Model providers and token limits

For an LLM provider, the limit is often tokens per minute, not requests per second.

Estimate tokens per input, then choose a chunk size and worker count that stay below the limit.

This tries to send about 25 prompts per second across the job:

If the provider bills or limits by token, reduce worker count when prompts or outputs get longer.

Choose the first value for max_parallelism

Start lower than the theoretical limit.

Examples:

  1. API allows 1,000 requests per second. Start at 500.

  2. Postgres has 200 available connections. Start at 80.

  3. Website tolerated 100 workers in a test. Start at 50.

  4. GPU quota allows 16 workers. Start at 8.

  5. Vector database allows 100 upserts. Start at 40.

Raise the cap after you see clean logs, stable latency, and no growing error rate.

Examples that use this pattern