Page cover

Map geotagged Flickr photos

In this example we:

  • Process 9,487,758 public Flickr photos from the YFCC100M subset.

  • Reverse-geocode every photo with latitude and longitude.

  • Build country-level signatures from user-written tags.

  • Reduce shard outputs into browsable country and region summaries.

No captions are generated here. The text comes from people, which is part of what makes the result interesting.

Dataset: YFCC100M metadata shards

The input is a set of compressed JSONL metadata shards on Hugging Face. Each row may contain a title, user tags, and optional latitude/longitude.

import gzip
import json
import os
from collections import Counter, defaultdict
from pathlib import Path

import requests
from huggingface_hub import hf_hub_url
import reverse_geocoder as rg
from burla import remote_parallel_map

REPO_ID = "dalle-mini/YFCC100M_OpenAI_subset"
SHARD_DIR = Path("/workspace/shared/wpi/shards")
FINAL_DIR = Path("/workspace/shared/wpi/final")
SHARD_IDS = [f"{i:05d}" for i in range(96)]

Step 1: Process one metadata shard per worker

Each worker downloads one metadata shard, keeps geotagged photos, reverse-geocodes them, and writes compact JSONL.

The worker writes only the fields needed for the later token counts. It does not keep image bytes or generate captions.

Step 2: Run the shard workers

Then run the full metadata set.

Step 3: Reduce counters

The reduce stage reads shard JSONL files and counts user-written words by geography.

What's the point?

A tag map gets better when it gets bigger. Small samples overstate tourist centers and erase regional vocabulary. The full run lets weird country signatures compete because every geotagged photo gets a vote.

My favorite part is that this is mostly not ML. Reverse geocoding and token counting answer the question directly. If the metadata already contains the signal, spend the compute on coverage instead of inventing a model step.

Last updated