Page cover

Ranking 1.7M Airbnbs by TV location

In this example we:

  • Process every public Inside Airbnb listing across 119 cities and 4 quarterly snapshots.

  • CLIP-score 1.7M photos, then use Claude Haiku Vision to validate the shortlists.

  • Analyze 50,686,612 reviews with a staged review funnel.

  • Reduce everything into reusable Parquet artifacts and site JSON.

This is the demo where the infrastructure really changes the experiment. A few cities can make a nice story. The whole corpus tells you whether the story survives contact with the rest of the world.

Dataset: Inside Airbnb snapshots

The input is public city-level Inside Airbnb data: listings, photos, and reviews across multiple quarterly snapshots.

import json
from dataclasses import dataclass
from pathlib import Path

from burla import remote_parallel_map

ROOT = Path("/workspace/shared/airbnb")
LISTING_DIR = ROOT / "listings"
PHOTO_DIR = ROOT / "photos"
REVIEW_DIR = ROOT / "reviews"
FINAL_DIR = ROOT / "final"
N_WORKERS = 1000

@dataclass(frozen=True)
class CitySnapshot:
    city_slug: str
    snapshot: str
    listings_url: str
    reviews_url: str

city_snapshots = [
    CitySnapshot(**row)
    for row in json.loads(Path("inside-airbnb-snapshots.json").read_text())
]

Step 1: Write artifacts, not vibes

Every stage writes Parquet under /workspace/shared/airbnb. That way a failed GPU stage does not force a listing re-download, and a new correlation test can reuse the same scored files.

That wrapper is not magic. It just keeps the resource choices visible at each stage.

Step 2: Normalize listings first

The listing stage downloads each city snapshot, keeps the columns the later stages need, and writes one Parquet file per city snapshot.

Run one city before the full matrix.

Then run every city snapshot.

Step 3: Score every photo first

The photo stage downloads image URLs, runs CLIP on CPU workers, and writes Parquet shards for the next stages. The broad pass stays cheap and parallel.

The source notebook builds photo_batches from the listing artifacts, with each batch containing image URLs and listing ids.

The broad pass should be cheap enough to run over every photo. It only needs to produce shortlist scores.

Step 4: Validate the photo shortlists

The original YOLOv8 GPU stage is still in the source repo for completeness, but it is not the live photo pipeline. The published demo uses Haiku Vision to validate the CLIP shortlists for pets, TV placement, hectic kitchens, and drug-den vibes.

This is a useful split: cheap model over everything, expensive model only where it can change the answer.

Step 5: Funnel the reviews

Reviews use the same shape: cheap heuristic scoring on every review, SBERT embeddings for the top 200,000, then Claude Haiku scoring for the top 12,000.

The review batches are built from the downloaded review artifacts. Each stage narrows the candidate set before the next, more expensive stage.

Each stage writes an artifact. The next stage reads artifacts, not notebook variables.

Step 6: Reduce into analysis outputs

The final reducer joins listings, photo labels, review scores, and bootstrap confidence intervals into the files used by the public site.

What's the point?

I would not start with the website. I would start with artifact contracts: listings, photo manifest, CPU image scores, Haiku-validated photo categories, review scores, correlations, site JSON.

That sounds boring, but it is what keeps the project honest. You can rerun one stage, inspect the bad shard, or try a new hypothesis without paying for everything again. I think this is the difference between a fun notebook and an analysis you can actually believe.

Last updated