Page cover

Embed 50K Wikipedia articles

In this example we:

  • Download 50,000 English Wikipedia articles from the Hugging Face wikimedia/wikipedia dataset.

  • Prepare small JSONL text shards on CPU workers.

  • Embed every shard with BAAI/bge-large-en-v1.5 on A100 workers.

  • Write vector shards to shared storage, then search across the combined matrix.

The goal is not to make a toy embedding script. The goal is to keep the real production shape visible: cheap CPU workers prepare text, expensive GPU workers run the model, and the only thing the client combines at the end is compact vector artifacts.

Dataset: English Wikipedia

We use the 20231101.en split of wikimedia/wikipedia. Each row has an article id, URL, title, and text body.

For this demo we only use the first 50,000 articles. That is small enough to inspect and rerun, but large enough that the pipeline has the same shape as a real backfill.

import json
import math
import os
from itertools import islice
from pathlib import Path

import numpy as np
from burla import remote_parallel_map
from datasets import load_dataset

MODEL_NAME = "BAAI/bge-large-en-v1.5"
GPU_IMAGE = "jakezuliani/burla-embedder:latest"
SHARED_ROOT = Path("/workspace/shared/vector_embeddings_demo")

ARTICLE_COUNT = 50_000
TEXT_SHARDS = 50
ARTICLES_PER_SHARD = math.ceil(ARTICLE_COUNT / TEXT_SHARDS)
MAX_GPU_PARALLELISM = int(os.environ.get("DEMO_MAX_GPU_PARALLELISM", 8))

/workspace/shared is backed by Burla shared storage. Anything written there by one worker can be read later by the client or by another worker.

The client environment needs burla, datasets, numpy, and sentence-transformers. The GPU worker environment comes from the image in the next step.

Step 1: Use a CUDA image

For the GPU stage we use a custom image with PyTorch, sentence-transformers, numpy, and the model weights already installed.

Build and push that image to a registry your Burla workers can pull from. In this example, the pushed image is:

Baking the model into the image is deliberate. Without it, every A100 worker starts by downloading the same model weights, and the demo turns into a network and cache test instead of an embedding job.

Step 2: Prepare text shards on CPU workers

The CPU stage downloads article text and writes 50 JSONL shards. Each shard contains 1,000 articles, trimmed to the first 2,000 characters so the GPU stage does predictable work.

Run one shard first. This is the same smoke-test habit as the XGBoost example: prove the dataset path, packages, and output shape before launching the full job.

Then prepare all 50 shards.

This stage does not ask for GPUs. It is just I/O and light string cleanup, so giving it A100s would make the example more expensive without making it clearer or faster in the way that matters.

For 50,000 articles, simple streaming offsets keep the code easy to read. For a million-article run, I would shard by source Parquet file first so workers do not rescan earlier rows.

Step 3: Embed each shard on A100s

Now each GPU worker reads one JSONL shard, loads the embedding model once, and writes two files:

  • a .npy matrix containing normalized float32 vectors

  • a metadata JSONL file containing ids, URLs, and titles in the same order

The worker returns paths to those files. It does not return the vectors through Python.

Run one GPU shard first.

If that works, run the full embedding stage.

max_parallelism is the GPU budget knob. If your account can run 8 A100 workers, use 8. If it can run 2, use 2. The code does not change.

Step 4: Search the vectors

The expensive part is finished. Search only needs the vector shards, the metadata shards, and one query vector.

For one query, running the model locally on CPU is fine. If you do not want the model on your client at all, write a tiny query-embedding function and run it with the same GPU image.

This is the moment the example is trying to make boring: vectors are just files, metadata is just JSONL, and search is just a matrix multiply. Burla helped with the part that should be parallel, then got out of the way.

What's the point?

Embedding examples often hide the hard part by quietly running a smaller model on CPU. That makes the notebook easy, but it dodges the thing users usually need help with: CUDA images, model load time, GPU quota, and the handoff between cheap preparation work and expensive model work.

This example keeps the real split visible. CPU workers prepare article shards. A100 workers embed those shards and write vector artifacts. The client loads the finished files and asks the search question.

That is the pattern I would copy into a real backfill for support tickets, PDFs, product catalogs, legal documents, code snippets, or anything else where the model belongs on GPUs but the rest of the pipeline does not.

Last updated