Page cover

Clustering 2.7M arXiv abstracts

In this example we:

  • Embed 2,710,783 arXiv abstracts.

  • Cluster the whole corpus with MiniBatchKMeans.

  • Use FAISS to find lonely papers and topic clusters that faded over time.

If the question is historical, a recent-paper sample is almost useless. It starts after half the history already happened.

Dataset: arXiv metadata JSONL

The arXiv snapshot is one large JSONL file. We first turn it into Parquet shards so the embedding stage has clean inputs.

import json
from pathlib import Path

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
from burla import remote_parallel_map
from sentence_transformers import SentenceTransformer

RAW_JSONL = Path("/workspace/shared/arxiv/arxiv-metadata-oai-snapshot.json")
RAW_DIR = Path("/workspace/shared/arxiv/raw")
VEC_DIR = Path("/workspace/shared/arxiv/vectors")
FINAL_DIR = Path("/workspace/shared/arxiv/final")
PAPERS_PER_SHARD = 10_000
EMBED_BATCH = 128
MODEL_NAME = "BAAI/bge-small-en-v1.5"

Step 1: Shard the metadata

The client streams the JSONL once and writes 10,000-paper Parquet shards into shared storage.

Step 2: Embed each shard

Each worker reads one raw shard, embeds title plus abstract, normalizes the vectors, and writes another Parquet shard.

Run one shard first, then launch the full embedding pass.

Step 3: Reduce the whole corpus

The reduce worker loads the vector shards, clusters a sample, predicts labels for all papers, and builds a nearest-neighbor index.

What's the point?

The worker cannot know whether a topic is extinct. It only sees one shard. The label comes later, once the whole archive is visible.

That is the useful shape here: many workers produce vectors, then one bigger worker makes the global decision. If I were doing this for patents, PubMed abstracts, legal opinions, or internal docs, I would keep the same split. The map stage is embarrassingly parallel. The reduce stage is where the corpus-level question becomes possible.

Last updated