Page cover

Summarize 1M GitHub READMEs

In this example we:

  • Export 1,200,000 GitHub READMEs from BigQuery.

  • Upload the Parquet to Burla shared storage.

  • Run deterministic summarizers over 600 stable shards.

  • Reduce category counts, examples, and keyword statistics into frontend JSON.

I like this one because the first instinct is to ask an LLM. That would make individual rows prettier and the aggregate harder to trust.

Dataset: README Parquet export

The BigQuery export includes repository metadata, README text, language, stars, and a deterministic shard_id.

import heapq
import json
from collections import Counter, defaultdict
from pathlib import Path

import pandas as pd
import pyarrow.dataset as ds
from burla import remote_parallel_map

PARQUET_PATH = "/workspace/shared/grs/readmes.parquet"
SHARD_DIR = Path("/workspace/shared/grs/shards")
FINAL_DIR = Path("/workspace/shared/grs/final")
N_SHARDS = 600

CATEGORIES = {
    "ml": {"tensorflow": 4, "pytorch": 4, "embedding": 2, "llm": 4},
    "web": {"react": 3, "django": 2, "graphql": 3, "frontend": 2},
    "devops": {"docker": 3, "kubernetes": 4, "terraform": 4},
}

The shard_id is what keeps workers from all reading and filtering the same giant file by accident.

Step 1: Score one README

The scoring function is deliberately inspectable. If a category looks wrong later, the weights are right here.

This is not trying to write beautiful prose. It is trying to make aggregate README patterns measurable.

Step 2: Summarize one shard

Each worker reads one shard_id, scores the READMEs, and writes a JSON shard.

Step 3: Run the shards

Smoke test one shard, then run the full shard list.

Step 4: Reduce counters and examples

The reducer keeps counts plus small heaps of representative repos.

What's the point?

Pretty summaries of famous repos are the boring version. I care about README culture at scale: install instructions, badges, code fences, category words, cloned templates, and empty placeholders.

A model would make the rows sound smoother. I do not want smoother here. I want counts I can debug. If a category looks wrong, I can inspect the keyword weights and rerun the reduce.

Last updated