Page cover

Resize an image corpus

In this example we:

  • List a large source image corpus from S3.

  • Resize each image into 256, 512, and 1024 pixel variants.

  • Write resized images back to S3.

  • Stream a manifest with successes, dimensions, and failures.

A preview folder always looks fine. The full corpus is where the EXIF rotations, corrupt PNGs, CMYK JPEGs, and odd aspect ratios live.

Dataset: source images in S3

Assume raw images live under s3://my-photos/originals/ and outputs should be written to s3://my-photos-resized/.

import io
import json
import os
from pathlib import Path

import boto3
from PIL import Image, ImageOps
from burla import remote_parallel_map

SRC_BUCKET = "my-photos"
DST_BUCKET = "my-photos-resized"
SRC_PREFIX = "originals/"
OUT_PREFIX = "resized/"
MANIFEST_PATH = Path("/workspace/shared/image-resize/manifest.jsonl")
CHUNK_SIZE = 1_000
SIZES = [256, 512, 1024]

Step 1: Chunk the image keys

The client lists source keys and batches them into 1,000-image chunks.

Step 2: Resize inside the worker

The worker opens each image, fixes EXIF orientation, writes every target size, and returns a report row. Bad images become manifest rows instead of crashing the whole job.

Step 3: Smoke test one chunk

The first chunk usually reveals missing dependencies, bad credentials, or PIL edge cases.

Step 4: Stream the manifest

Workers write images directly to S3. The client writes the report as chunks finish.

What's the point?

The resized images are only half the result. The manifest tells you which files worked, what dimensions they had, and which ones need a retry.

If I were about to train on this dataset, I would want that manifest before training starts. Otherwise the model can silently skip the weird slice of the corpus, and you only find out later when the training data looks cleaner than reality.

Last updated