Page cover

ETL 10K S3 files to Postgres

In this example we:

  • List 10,000 gzipped JSON files from S3.

  • Transform each file in a Burla worker.

  • Load cleaned rows into Postgres while capping database concurrency.

  • Return per-file load reports so retries are obvious.

This is the kind of job that turns into an Airflow ticket when the actual problem is just: yesterday's files are too slow on one machine.

Dataset: daily S3 file drop

Assume each raw file is gzipped JSONL under a date prefix.

import gzip
import json
import os
from pathlib import Path

import boto3
import psycopg2
from burla import remote_parallel_map
from psycopg2.extras import execute_values

S3_BUCKET = "my-events-bucket"
DATE = "2025-04-19"
DATABASE_URL = os.environ["DATABASE_URL"]
MAX_DB_LOADERS = 25
REPORT_PATH = Path("/workspace/shared/file-drop-etl/load-report.jsonl")

Step 1: List the files

The client lists the daily prefix and builds one input per file.

Step 2: Transform and insert one file

The worker owns extract, transform, and load for one object. execute_values keeps each file as a small number of batched inserts.

The insert is idempotent because event_id is the conflict key. That makes retries safe.

Step 3: Smoke test one file

Run one file before opening many database connections.

Step 4: Protect Postgres

The database is the constraint, so max_parallelism is the important line.

What's the point?

Transforming 10,000 files in parallel is easy. Loading them without flattening Postgres is the part that matters.

That is why I like this shape. The Python stays boring, the insert stays idempotent, and the sink gets a real concurrency cap. You can put this behind cron or CI without adopting a workflow platform for one file drop.

Last updated