Top countries
Top industries
Cohort movement
New entrants
Dropped out
Aggregate wealth by age (sum across all billionaires)
Wealth concentration (snapshot history)
For the nerds — how the data is gathered
The short version
Bloomberg publishes a public ranking of the 500 wealthiest people at
bloomberg.com/billionaires/, plus a profile page per person.
A scheduler hits the list page on a cron and writes the current numbers
to a local SQLite database. A separate one-shot backfill walks every
profile page and pulls each person's full daily net-worth history
(some go back to 2012). After backfill, the daily scrape continues to
extend that history by one row per person per day. A second
enrichment pass cross-references each person against
Wikidata to map family ties, shared employers/schools/boards,
overlapping public-equity holdings, and to pull authoritative
metadata (gender, photo, biography, signature). A third enrichment
pass mines each person's Wikipedia article citations
for dated news references — IPOs, lawsuits, deaths, acquisitions
cited in their bio — so each profile carries a multi-decade news
timeline without needing a paid news API. A fourth pass imports a
public Kaggle dataset of Forbes annual rankings 2001-2024
so the time-travel UI can reconstruct any year's ranking even before
Bloomberg started tracking. Two SQLite files are
the source of truth: bloomberg.db for the scrape data,
the news, and the historical Forbes rankings;
network.db for everything Wikidata-derived.
Github Page
The technical version
Sources.
Both Bloomberg pages embed JSON inside <script>
tags as JavaScript globals — window.top500 on the list
page (an array of 500 person records with rank, net worth, asset
breakdowns, biography, milestones, schools, etc.) and
window.profileData on each profile page (the same record
plus a stats array of [date, net_worth_usd]
pairs). Wikidata is queried via its public SPARQL endpoint at
query.wikidata.org plus the wbsearchentities
REST API for QID resolution; Wikipedia's REST page summary endpoint
supplies thumbnail fallbacks when a person has no
P18 image. News comes from two sources: the
Wikipedia action=parse API (raw wikitext for citation
extraction, rate limit ~1 req/sec) and the
GDELT 2.0 Doc API at
api.gdeltproject.org/api/v2/doc/doc for recent news
(free, keyless, ≥5s between requests, English filter applied).
Historical Forbes rankings 2001-2024 come from the
guillemservera/forbes-billionaires-1997-2023 Kaggle dataset
(CC0-1.0, 34k rows of monthly snapshots that we dedupe to
one row per (year, rank) by max net worth). A small Wikipedia
scraper handles years past the Kaggle freeze.
Fetching.
Bloomberg fingerprints non-browser TLS clients and serves a captcha
page. curl_cffi impersonates Chrome's TLS handshake and
HTTP/2 settings, so the responses come back as real HTML. A regex
pulls the JSON literal out of the HTML and json.loads
parses it. Wikidata calls are batched (typically 80 QIDs per
SPARQL request) with retry-on-429 falling back to smaller chunks.
Storage.
Two SQLite databases, several tables each. From
bloomberg.db:
-
persons— static-ish fields (name, citizenship, biography, slug, gender). Upserted onperson_id. -
snapshots— append-only time-series of rich data (rank, net worth, last/ytd change, public/private/cash asset totals + JSON breakdowns, liabilities) — one row per scrape per person. -
wealth_history— narrow daily series, primary key(person_id, date). Idempotent: backfilling and the daily scrape bothINSERT OR REPLACEinto it. -
news_articles— one row per(person_id, url)citation. Carriesarticle_date,date_precision('day' / 'month' / 'year'),title,source, and animportanceinteger (see below). Sister tablenews_fetchedtracks when each person was last hit and whether the historical backfill has run;news_co_occurrencestores per-URL bookkeeping so cross-person rescoring is idempotent. -
historical_rankings— annual Forbes lists, primary key(source, year, rank). Storesperson_id(NULLable for names not in our Bloomberg set),name,net_worth_usd,citizenship,age,industry,notes. Sources includeforbes_kaggle(Kaggle dataset 2001-2024) andforbes_world(Wikipedia scrape, sparser, kept as fallback for years past the Kaggle freeze).
From network.db (Wikidata-derived, regenerated by the
refresh job):
-
persons_index— mirror ofpersonswith the resolvedwikidata_qid,image_url,signature_filename, and a JSONwikidata_metadatablob (description, Wikipedia URL, birth/death dates and places, residence, occupations, languages, children count). -
family_edges— undirected person↔person edges (spouse, parent, child, sibling, relative) derived from Wikidata properties P22/P25/P26/P40/P3373/P1038. -
entities+entity_links— bridging organizations (employers, schools, boards, awards, positions held) from P108/P69/P463/P3320/P800/P166/P39, plus inverse properties. Only entities connecting ≥2 of our billionaires are kept, so every node is by definition a connector. Year-by-year event editions (e.g. "WEF Annual Meeting 2018") are collapsed to their parent series via P179. -
entity_edges— entity↔entity relations (parent company, subsidiary, owned-by) so the graph can show structures like "Berkshire owns BNSF + GEICO + Apple" as a single spine. A second-tier pass adds up to 100 one-hop neighbors that bridge ≥2 first-tier entities. -
holdings_bridges— public-equity tickers that appear in the asset breakdowns of ≥2 billionaires, linking them through their shared positions.
Time travel.
The "as of [date]" slider on the Table tab unions
Bloomberg's daily wealth_history with the
annual historical_rankings snapshots. For
each person we take the most recent observation
at-or-before the target date across both sources;
Bloomberg wins when both have data for that year because
its dates are exact while Forbes years collapse to
YYYY-12-31. Within Forbes, forbes_kaggle is
preferred over the sparser forbes_world
Wikipedia scrape. Rank is computed in Python by sorting
the resulting per-person rows by net worth — that lets
the slider reach back to 2001 even though Bloomberg
history only goes to 2012. The diff mode runs two
as-of queries side by side and computes entered/exited/
rank-delta sets in a single pass.
News pipeline.
News on each profile comes from two layers. Historical
timeline: a one-time backfill fetches each person's
Wikipedia article wikitext via the action=parse
API and extracts every {{cite news|...}} /
{{cite web|...}} template. URL, title, publisher
domain, and date are pulled out of the citation parameters; date
formats like "2018-03-12", "12 March 2018", "March 2018", and
"2018" are all accepted but tagged with their precision so the
chart can drop year/month-only fallbacks (those would land on a
wealth value that wasn't real for that date). Recent
coverage: a daily refresh hits GDELT for the last 30 days
of each person's news, ~6s between calls to stay under their
throttle. Articles dedupe on
(person_id, url); the news card mixes both sources
transparently.
News importance scoring. Each article gets a heuristic score so the chart can surface the few genuinely interesting events instead of every passing mention. Components: a baseline of 4 for any Wikipedia citation (editorial inclusion is itself a signal); per-keyword bonuses on the title (dies, lawsuit, indicted, acquires, resigns, divorce, etc.); a +3 trusted-source bonus (Reuters, Bloomberg, WSJ, FT, NYT, FT, BBC, AP, Forbes, …); a citation-density bonus capped at +6 for URLs cited multiple times within the same Wikipedia article; and a cross-person co-occurrence bonus of +2 per additional billionaire whose page cites the same URL, capped at +12. "Rich list" articles ("Forbes 400", "Australia's Richest 200") are detected by title pattern and excluded from keyword scoring — they're rankings, not events. The chart pins markers at day-precision dates only and renders the top 3 per year visible in the current range. The news card shows everything ordered chronologically with year-tab filters.
Schedule.
APScheduler runs cron-style jobs in-process. Each daily scrape pulls
the list page, writes to persons and snapshots,
and appends today's row to wealth_history. Three
follow-up jobs fire automatically a few seconds later: a
wealth-history backfill for any newcomer profile, a Wikidata
catch-up that resolves QIDs for new persons and pulls their
authoritative gender + photo + metadata, and a news refresh
that pulls the last 30 days from GDELT for everyone in the
latest snapshot (skipping anyone fetched within the past 20h
so the queue rotates through all 500 across days). Visiting
a profile that's never had news fetched also kicks off a
one-shot fetch in the background; the UI polls and renders
articles as soon as they arrive.
Network refresh. A separate manual job rebuilds the entire Wikidata graph end-to-end: QID resolution → family relations → person metadata + photos → bridging entities → entity↔entity edges → second-tier bridges → shared holdings. It's idempotent and safe to re-run.
Backfill.
A background thread iterates every person, fetches their profile,
parses stats, bulk-inserts to wealth_history.
1.5s sleep between requests keeps it polite — ~13 minutes for all 500.
A "Sync from Snapshots" button copies any
(person_id, date, net_worth) pair that exists in
snapshots but not yet in wealth_history,
catching the small gap between deploy and first backfill.
Derived fields and overrides.
Gender starts as a heuristic (counting gendered pronouns and role
words in the biography — he/his/chairman
vs she/her/chairwoman) with a confidence score. When
Wikidata returns a P21 value during a refresh or newcomer
catch-up, it overwrites the heuristic and bumps confidence to 1.0.
Birth year is extracted from milestones[0] if it
mentions "born", otherwise computed from age. Wikipedia thumbnails
are filtered to drop coats-of-arms, company logos, and SVG-only
images that show up as P18-less fallbacks.
Six degrees + path-finder. The combined graph (family + entity + entity↔entity + holdings) is walked with BFS for the path-finder; six-degrees stats sample random pairs and report the average/median/max hops between any two billionaires.
Stack. FastAPI + Uvicorn (single port serves API and static frontend), SQLite × 2, APScheduler for cron + post-scrape one-shot jobs, curl_cffi for fetching, Wikidata SPARQL + Wikipedia REST + GDELT 2.0 Doc API for enrichment, Alpine.js + Chart.js + a custom force-directed canvas renderer for the UI. Pytest for the test suite. The whole thing fits in one Docker container with a mounted volume for the databases.