Page cover

Scrape 1M web pages

In this example we:

  • Read 1,000,000 URLs from a file.

  • Scrape them in 500-URL chunks.

  • Stream JSONL rows with either parsed fields or errors.

  • Keep retry and politeness rules inside the worker.

The stale pages, redirects, 503s, and broken HTML are the dataset. A happy-path sample filters out the part you most need to measure.

Dataset: URL archive

The input is a text file with one URL per line.

import json
import random
import time
from pathlib import Path

import httpx
from burla import remote_parallel_map
from selectolax.parser import HTMLParser

OUT_PATH = Path("/workspace/shared/web-scrape/pages.jsonl")
CHUNK = 500
MAX_PARALLELISM = 1_000

Step 1: Chunk URLs

The client only plans work. Each worker gets a chunk.

Step 2: Fetch and parse inside the worker

The worker keeps one HTTP client open, backs off on temporary failures, and returns an error row when a page fails.

Returning error rows is important. A failed page is still data about the archive.

Step 3: Smoke test a chunk

Run one chunk and inspect both successes and failures.

Step 4: Stream the crawl

Chunks stream back as they finish, so the scrape can run for hours without building one giant result list.

What's the point?

Scraping gets weird fast. DNS, TLS, parser misses, per-site politeness, and failure logging all matter.

This design is plain on purpose: chunk URLs, reuse a client, parse the fields you need, return an error row when something fails. Once the output exists, you can compute failure rates by host or retry only bad chunks.

Last updated