Bloomberg Scraper

Total Wealth

Billionaires

Snapshots

Daily History

people · since

Run "Backfill History" in the Scraper tab to populate

Top countries

Top industries

Cohort movement

Aggregate wealth by age (sum across all billionaires)

Smoothing

Wealth concentration (snapshot history)

For the nerds — how the data is gathered

The short version

Bloomberg publishes a public ranking of the 500 wealthiest people at bloomberg.com/billionaires/, plus a profile page per person. A scheduler hits the list page on a cron and writes the current numbers to a local SQLite database. A separate one-shot backfill walks every profile page and pulls each person's full daily net-worth history (some go back to 2012). After backfill, the daily scrape continues to extend that history by one row per person per day. A second enrichment pass cross-references each person against Wikidata to map family ties, shared employers/schools/boards, overlapping public-equity holdings, and to pull authoritative metadata (gender, photo, biography, signature). A third enrichment pass mines each person's Wikipedia article citations for dated news references — IPOs, lawsuits, deaths, acquisitions cited in their bio — so each profile carries a multi-decade news timeline without needing a paid news API. A fourth pass imports a public Kaggle dataset of Forbes annual rankings 2001-2024 so the time-travel UI can reconstruct any year's ranking even before Bloomberg started tracking. Two SQLite files are the source of truth: bloomberg.db for the scrape data, the news, and the historical Forbes rankings; network.db for everything Wikidata-derived. Github Page

The technical version

Sources. Both Bloomberg pages embed JSON inside <script> tags as JavaScript globals — window.top500 on the list page (an array of 500 person records with rank, net worth, asset breakdowns, biography, milestones, schools, etc.) and window.profileData on each profile page (the same record plus a stats array of [date, net_worth_usd] pairs). Wikidata is queried via its public SPARQL endpoint at query.wikidata.org plus the wbsearchentities REST API for QID resolution; Wikipedia's REST page summary endpoint supplies thumbnail fallbacks when a person has no P18 image. News comes from two sources: the Wikipedia action=parse API (raw wikitext for citation extraction, rate limit ~1 req/sec) and the GDELT 2.0 Doc API at api.gdeltproject.org/api/v2/doc/doc for recent news (free, keyless, ≥5s between requests, English filter applied). Historical Forbes rankings 2001-2024 come from the guillemservera/forbes-billionaires-1997-2023 Kaggle dataset (CC0-1.0, 34k rows of monthly snapshots that we dedupe to one row per (year, rank) by max net worth). A small Wikipedia scraper handles years past the Kaggle freeze.

Fetching. Bloomberg fingerprints non-browser TLS clients and serves a captcha page. curl_cffi impersonates Chrome's TLS handshake and HTTP/2 settings, so the responses come back as real HTML. A regex pulls the JSON literal out of the HTML and json.loads parses it. Wikidata calls are batched (typically 80 QIDs per SPARQL request) with retry-on-429 falling back to smaller chunks.

Storage. Two SQLite databases, several tables each. From bloomberg.db:

persons — static-ish fields (name, citizenship, biography, slug, gender). Upserted on person_id.
snapshots — append-only time-series of rich data (rank, net worth, last/ytd change, public/private/cash asset totals + JSON breakdowns, liabilities) — one row per scrape per person.
wealth_history — narrow daily series, primary key (person_id, date). Idempotent: backfilling and the daily scrape both INSERT OR REPLACE into it.
news_articles — one row per (person_id, url) citation. Carries article_date, date_precision ('day' / 'month' / 'year'), title, source, and an importance integer (see below). Sister table news_fetched tracks when each person was last hit and whether the historical backfill has run; news_co_occurrence stores per-URL bookkeeping so cross-person rescoring is idempotent.
historical_rankings — annual Forbes lists, primary key (source, year, rank). Stores person_id (NULLable for names not in our Bloomberg set), name, net_worth_usd, citizenship, age, industry, notes. Sources include forbes_kaggle (Kaggle dataset 2001-2024) and forbes_world (Wikipedia scrape, sparser, kept as fallback for years past the Kaggle freeze).

From network.db (Wikidata-derived, regenerated by the refresh job):

persons_index — mirror of persons with the resolved wikidata_qid, image_url, signature_filename, and a JSON wikidata_metadata blob (description, Wikipedia URL, birth/death dates and places, residence, occupations, languages, children count).
family_edges — undirected person↔person edges (spouse, parent, child, sibling, relative) derived from Wikidata properties P22/P25/P26/P40/P3373/P1038.
entities + entity_links — bridging organizations (employers, schools, boards, awards, positions held) from P108/P69/P463/P3320/P800/P166/P39, plus inverse properties. Only entities connecting ≥2 of our billionaires are kept, so every node is by definition a connector. Year-by-year event editions (e.g. "WEF Annual Meeting 2018") are collapsed to their parent series via P179.
entity_edges — entity↔entity relations (parent company, subsidiary, owned-by) so the graph can show structures like "Berkshire owns BNSF + GEICO + Apple" as a single spine. A second-tier pass adds up to 100 one-hop neighbors that bridge ≥2 first-tier entities.
holdings_bridges — public-equity tickers that appear in the asset breakdowns of ≥2 billionaires, linking them through their shared positions.

Time travel. The "as of [date]" slider on the Table tab unions Bloomberg's daily wealth_history with the annual historical_rankings snapshots. For each person we take the most recent observation at-or-before the target date across both sources; Bloomberg wins when both have data for that year because its dates are exact while Forbes years collapse to YYYY-12-31. Within Forbes, forbes_kaggle is preferred over the sparser forbes_world Wikipedia scrape. Rank is computed in Python by sorting the resulting per-person rows by net worth — that lets the slider reach back to 2001 even though Bloomberg history only goes to 2012. The diff mode runs two as-of queries side by side and computes entered/exited/ rank-delta sets in a single pass.

News pipeline. News on each profile comes from two layers. Historical timeline: a one-time backfill fetches each person's Wikipedia article wikitext via the action=parse API and extracts every {{cite news|...}} / {{cite web|...}} template. URL, title, publisher domain, and date are pulled out of the citation parameters; date formats like "2018-03-12", "12 March 2018", "March 2018", and "2018" are all accepted but tagged with their precision so the chart can drop year/month-only fallbacks (those would land on a wealth value that wasn't real for that date). Recent coverage: a daily refresh hits GDELT for the last 30 days of each person's news, ~6s between calls to stay under their throttle. Articles dedupe on (person_id, url); the news card mixes both sources transparently.

News importance scoring. Each article gets a heuristic score so the chart can surface the few genuinely interesting events instead of every passing mention. Components: a baseline of 4 for any Wikipedia citation (editorial inclusion is itself a signal); per-keyword bonuses on the title (dies, lawsuit, indicted, acquires, resigns, divorce, etc.); a +3 trusted-source bonus (Reuters, Bloomberg, WSJ, FT, NYT, FT, BBC, AP, Forbes, …); a citation-density bonus capped at +6 for URLs cited multiple times within the same Wikipedia article; and a cross-person co-occurrence bonus of +2 per additional billionaire whose page cites the same URL, capped at +12. "Rich list" articles ("Forbes 400", "Australia's Richest 200") are detected by title pattern and excluded from keyword scoring — they're rankings, not events. The chart pins markers at day-precision dates only and renders the top 3 per year visible in the current range. The news card shows everything ordered chronologically with year-tab filters.

Schedule. APScheduler runs cron-style jobs in-process. Each daily scrape pulls the list page, writes to persons and snapshots, and appends today's row to wealth_history. Three follow-up jobs fire automatically a few seconds later: a wealth-history backfill for any newcomer profile, a Wikidata catch-up that resolves QIDs for new persons and pulls their authoritative gender + photo + metadata, and a news refresh that pulls the last 30 days from GDELT for everyone in the latest snapshot (skipping anyone fetched within the past 20h so the queue rotates through all 500 across days). Visiting a profile that's never had news fetched also kicks off a one-shot fetch in the background; the UI polls and renders articles as soon as they arrive.

Network refresh. A separate manual job rebuilds the entire Wikidata graph end-to-end: QID resolution → family relations → person metadata + photos → bridging entities → entity↔entity edges → second-tier bridges → shared holdings. It's idempotent and safe to re-run.

Backfill. A background thread iterates every person, fetches their profile, parses stats, bulk-inserts to wealth_history. 1.5s sleep between requests keeps it polite — ~13 minutes for all 500. A "Sync from Snapshots" button copies any (person_id, date, net_worth) pair that exists in snapshots but not yet in wealth_history, catching the small gap between deploy and first backfill.

Derived fields and overrides. Gender starts as a heuristic (counting gendered pronouns and role words in the biography — he/his/chairman vs she/her/chairwoman) with a confidence score. When Wikidata returns a P21 value during a refresh or newcomer catch-up, it overwrites the heuristic and bumps confidence to 1.0. Birth year is extracted from milestones[0] if it mentions "born", otherwise computed from age. Wikipedia thumbnails are filtered to drop coats-of-arms, company logos, and SVG-only images that show up as P18-less fallbacks.

Six degrees + path-finder. The combined graph (family + entity + entity↔entity + holdings) is walked with BFS for the path-finder; six-degrees stats sample random pairs and report the average/median/max hops between any two billionaires.

Stack. FastAPI + Uvicorn (single port serves API and static frontend), SQLite × 2, APScheduler for cron + post-scrape one-shot jobs, curl_cffi for fetching, Wikidata SPARQL + Wikipedia REST + GDELT 2.0 Doc API for enrichment, Alpine.js + Chart.js + a custom force-directed canvas renderer for the UI. Pytest for the test suite. The whole thing fits in one Docker container with a mounted volume for the databases.

Headline change:

Top:

Quick jump:

From

Compare span:

entered top 100

exited top 100

total gained (top 10)

total lost (top 10)

Entered the top 100

No new entrants.

Exited the top 100

No drop-offs.

Biggest gainers

Biggest losers

Top gainers today

Top losers today

Insights

Year range: →

Top 10 over time?

Color = industry (click to open market view):

Number of billionaires over time?

Click a country/industry line to filter every chart on this tab.

Inequality within the list?

Gini coefficient (purple) and top-10 share (orange). Higher Gini = more spread. Top-10 share falls as the list grows.

Class of ?

()

Pairwise wealth correlation (top , days)?

Quick presets:

Strongest pairs

High correlation often signals shared underlying assets — co-founders, family heirs, same stock.

Geographic migration?

Top cross-border flows

Top residence countries

Compare wealth over time?

Quick:

Wealth concentration over time?

Wealth by industry?

Click a bar → market deep-dive

Wealth by country?

Click a bar → market deep-dive

Gender distribution?

Age distribution?

Network graph

Family ties plus shared employers, schools, founded companies, and boards among Bloomberg's 500 — sourced from Wikidata. Entities only appear when they bridge ≥ 2 billionaires.

Hide isolated

Find a path

→

No edges yet. Click "Refresh from Wikidata" — first run takes ~5 minutes (one wbsearchentities call per person, then batched SPARQL queries for family relations, entity relations, and entity labels).

spouse parent / child sibling relative company school board / org shared holding private co. political party

Live scrape

Bloomberg scrape

Pulls bloomberg.com/billionaires now. Auto-runs after the scrape: history backfill for newcomers, Wikidata QID resolution, and recent-news refresh.

Bootstrap (first-time / catch-up)

Run everything once

Sequentially runs every data-load job needed to bring an empty deployment fully online: Forbes Kaggle → Forbes Wikipedia → Wikidata + family network → GDELT news → Wikipedia news backfill → history sync. Bloomberg live scrape is excluded (it's on the schedule). Each step is idempotent — safe to re-run, fills any gaps.

Step / :

Finished .

Historical backfills

Bloomberg wealth history

Pulls each profile's full daily net-worth time series. ~13 minutes total at the polite 1.5s/profile rate.

— coverage: people, rows

Wikipedia news backfill

Mines each person's Wikipedia article for `{{cite news}}` references. Builds a multi-decade news timeline (2000s–today).

Recent news refresh

Pulls the past 30 days of news for every person via GDELT. Auto-runs after each scrape; this button forces an extra pass now. ~5–7s/person from a cold cache, so a full 500-person pass takes 40–60 minutes.

Forbes 1997–2024 (Kaggle)

Downloads the guillemservera/forbes-billionaires-1997-2023 dataset (CC0) and imports the annual rankings — adds 30k+ historical rows that power the time-travel slider back to 2001.

Forbes via Wikipedia (fallback)

Scrapes the per-year Wikipedia "World's Billionaires" tables. Sparse and brittle — kept for years past the Kaggle 2024 freeze.

Wikidata network refresh

Rebuilds the family + entity + holdings graph from Wikidata SPARQL. Heavy: takes 10–30 minutes; only run when the daily newcomer-catchup isn't enough.

Schedule

Daily scrape times

Schedule is paused. Enable to resume automatic scraping.

Timezone:

Recent Runs

Time	Status	Records	Duration

Top countries

Top industries

Cohort movement

New entrants

Dropped out

Aggregate wealth by age (sum across all billionaires)

Wealth concentration (snapshot history)

The short version

The technical version

Entered the top 100

Exited the top 100

Biggest gainers

Biggest losers

Top gainers today

Top losers today

Top 10 over time?

Number of billionaires over time?

Inequality within the list?

Class of ?

()

Pairwise wealth correlation (top , days)?

Strongest pairs

Geographic migration?

Top cross-border flows

Top residence countries

Compare wealth over time?

Wealth concentration over time?

Wealth by industry?

Wealth by country?

Gender distribution?

Age distribution?

Network graph

↔

Most connected people

Top bridging entities

Top schools

Top dynasties by wealth

Largest family cluster

Scraper access

Live scrape

Bootstrap (first-time / catch-up)

Historical backfills

Schedule

Daily scrape times

Recent Runs

Download Data

Identity

Demographics

Financial

Assets & Liabilities

Personal Info

Metadata

Wealth History

Download Databases

Net worth history

About

Biography

Overview

Family

Connections

Education

Milestones

News coverage

Assets

Public Holdings

Private Holdings

Facts