Page cover

Ranking 572M Amazon reviews

In this example we:

  • Stream 275 GB of Amazon review JSONL from HuggingFace.

  • Split 34 category files into 545 byte-range chunks instead of downloading everything first.

  • Find the most absurd ones (build the wall of rants & unhinged mode) with deterministic scoring.

The goal is not just a funny sample. The repo parses 571,544,386 reviews, keeps tiny top-K heaps per shard, merges them into category findings, then runs a second worst-of-worst pass for censored strong profanity and categorized slur hits.

Dataset: Amazon Reviews 2023

The raw dataset is a set of large JSONL files, one per category. We stream byte ranges so each worker owns a slice of a file.

import heapq
import json
import math
from pathlib import Path

import requests
from burla import remote_parallel_map
from huggingface_hub import HfApi

REPO_ID = "McAuley-Lab/Amazon-Reviews-2023"
HF_BASE = f"https://huggingface.co/datasets/{REPO_ID}/resolve/main/"
SHARD_DIR = Path("/workspace/shared/amazon-reviews/shards")
FINAL_DIR = Path("/workspace/shared/amazon-reviews/final")
TOP_K_PER_SHARD = 200

Step 1: Plan byte ranges

Each category file is huge, so we turn it into roughly 500 MB jobs and keep one byte range per worker.

Step 2: Stream records safely

The worker asks HuggingFace for a byte range, discards the first partial line when needed, and parses JSON rows.

Byte ranges are what make the job restartable. A failed chunk is just one file path plus two byte offsets.

Step 3: Score one chunk

The main pass keeps a small heap of the funniest/highest-signal reviews. The worker writes its heap to shared storage and returns a compact report.

Step 4: Run both scoring passes

The main pass scores profanity, caps, rants, five-star mismatch, and punctuation storms. The worst pass hunts censored strong profanity and categorized slur hits for Unhinged Mode.

process_worst has the same input and output shape as process_main; only the scoring rules are stricter.

Step 5: Reduce into site artifacts

One reducer merges the main shards into the Wall of Rants. Another merges the worst-of-worst shards. A final local analysis step handles rescoring, deduping, search pools, category findings, and the Unhinged Mode JSON.

What's the point?

A sample can find funny reviews. It cannot tell you whether Video Games is actually more profane than Beauty, or whether one 10,594-exclamation review is rare or part of a pattern.

I also like that this version does not need an LLM. Regexes, counters, lengths, caps, punctuation, context classifiers, and heaps are enough to produce both the public Wall of Rants and the much harsher Unhinged Mode.

Last updated