Page cover

Run batch LLM inference

In this example we:

  • Read review text from Parquet.

  • Load a HuggingFace sentiment model once per worker.

  • Score the whole corpus in batches.

  • Stream JSONL results as batches finish.

I would not build an endpoint for this. There is no traffic to serve. There is just a pile of rows that need model scores.

Dataset: product reviews in Parquet

Assume the source dataset has review_id and text columns.

import json
from pathlib import Path

import pyarrow.dataset as ds
from burla import remote_parallel_map

DATASET = "s3://my-bucket/reviews/"
OUT_PATH = Path("/workspace/shared/batch-inference/review-sentiment.jsonl")
BATCH_SIZE = 10_000
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"

Step 1: Build batches

The client reads ids and text, then builds 10,000-row batches. Each batch is one worker input.

Step 2: Write the worker function

Each worker loads the model the first time it runs, then reuses it for later batches on the same process.

The model stays cached on the worker process. Later batches assigned to that process do not reload it.

Step 3: Smoke test one batch

Run one batch first so you can see model download time, memory, and output shape.

Step 4: Run the full scoring job

The output streams back as each batch finishes, so we can write JSONL without holding everything in memory.

What's the point?

The endpoint version is usually overbuilt. Health checks, autoscaling, request formats, and idle capacity are useful when users are sending traffic. They are annoying when I just need to score a dataset once.

The real question is whether the model, batch size, token length, memory, and output format survive the full corpus. A tiny sample mostly tells you the imports work. A batch job tells you whether the exact scoring code can finish every row and leave behind a file you can audit.

Last updated