A Model Context Protocol (MCP) server that exposes the IPUMS API as tools for LLM clients. Supports IPUMS microdata collections (USA, CPS, IPUMSI, etc.) and NHGIS aggregate/GIS data. Includes tools for browsing metadata, submitting and downloading extracts, generating reproducible R/Python code, and producing formatted citations.
- Node.js 18+
- An IPUMS API key
- Any MCP-compatible client (Claude Desktop, Claude Code, Cursor, Windsurf, etc.)
1. Clone and build
git clone https://github.com/hesscl/ipums-mcp
cd ipums-mcp
npm install
npm run build2. Configure your MCP client
All MCP clients use the same server entry point — only the config file location differs. Set IPUMS_API_KEY as an environment variable; never hard-code it.
The server command is always:
node /absolute/path/to/ipums-mcp/dist/index.js
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"ipums": {
"command": "node",
"args": ["/absolute/path/to/ipums-mcp/dist/index.js"],
"env": { "IPUMS_API_KEY": "your-key-here" }
}
}
}Restart Claude Desktop after saving.
claude mcp add ipums node /absolute/path/to/ipums-mcp/dist/index.js \
-e IPUMS_API_KEY=your-key-hereOr add manually to .claude/settings.json:
{
"mcpServers": {
"ipums": {
"command": "node",
"args": ["/absolute/path/to/ipums-mcp/dist/index.js"],
"env": { "IPUMS_API_KEY": "your-key-here" }
}
}
}Add to your editor's MCP settings (typically under Settings → MCP or a mcp.json file):
{
"mcpServers": {
"ipums": {
"command": "node",
"args": ["/absolute/path/to/ipums-mcp/dist/index.js"],
"env": { "IPUMS_API_KEY": "your-key-here" }
}
}
}Any client that supports the MCP stdio transport works. Provide:
- Command:
node - Args:
["/absolute/path/to/ipums-mcp/dist/index.js"] - Env:
IPUMS_API_KEY=your-key-here
| Tool | Description |
|---|---|
microdata_search_variables |
Search the offline variable database by name, label, group, or sample (no API key needed) |
microdata_list_samples |
List available samples for a collection (use to find sample IDs by year/survey) |
microdata_list_extracts |
List recent extracts for a collection |
microdata_get_extract |
Get status and download links for an extract |
microdata_create_extract |
Submit a new extract request |
microdata_wait_for_extract |
Poll until extract completes, with automatic bail-out for large extracts |
microdata_download_extract |
Download completed extract files to disk, with SHA-256 verification |
microdata_extract_to_code |
Generate reproducible R (ipumsr) or Python (ipumspy) code for an extract |
microdata_search_variables queries a local database of 827 harmonized IPUMS USA variables scraped from usa.ipums.org — no API key or network request needed.
Each result includes:
name— variable mnemonic (e.g.INCTOT)label— short description (e.g."Total personal income")type—"H"(household-level) or"P"(person-level)groups— thematic category (e.g."Person: Income")samples— representative IPUMS USA sample IDs where the variable is available (e.g.["us2024a", "us2023a", ..., "us1850a"])
Filter by any combination of query, type, group, and sample:
microdata_search_variables({ query: "income", type: "P", sample: "us2024a" })
microdata_search_variables({ group: "Education" })
microdata_search_variables({ query: "veteran", sample: "us1990a" })
The
samplesfield lists the representative display columns from usa.ipums.org — a subset of all available samples. Cross-reference withmicrodata_list_samplesfor the complete list.
Samples and variables use array syntax:
{
"collection": "usa",
"dataFormat": "csv",
"dataStructure": { "rectangular": { "on": "P" } },
"description": "2022 ACS veterans",
"samples": [{ "id": "us2022a" }],
"variables": [
{ "name": "VETSTAT" },
{ "name": "AGE" },
{ "name": "SEX" },
{ "name": "STATEFIP" }
]
}The server converts these to the keyed-object format the IPUMS API v2 requires.
The response includes a size advisory based on samples × variables to help decide whether to poll immediately or check back later:
| Complexity | Advisory |
|---|---|
| ≤ 200 | Small — microdata_wait_for_extract is fine (1–3 min) |
| ≤ 2,000 | Medium — consider asking the user (3–15 min) |
| > 2,000 | Large — recommend checking back with microdata_get_extract (20–60+ min) |
microdata_wait_for_extract defaults to a 90-second timeout. If the extract is still processing when the timeout is reached, the tool returns early with:
{
"status": "still_processing",
"message": "Extract 42 is still queued after 90s (6 check(s)). Ask the user if they would like to wait longer, or tell them to check back later using: microdata_get_extract({ collection: \"usa\", extractNumber: 42 })",
"extract": { ... }
}This keeps the conversation responsive — the user gets the extract number and can ask to resume polling or check back manually at their convenience. Pass a higher timeoutSeconds only if the user explicitly agrees to wait.
{
"collection": "usa",
"extractNumber": 42,
"outputDir": "/data/ipums",
"fileTypes": ["data", "ddiCodebook"]
}Available fileTypes: data, ddiCodebook, basicCodebook, rCommandFile, spssCommandFile, stataCommandFile, stsCommandFile, sasCommandFile.
Note: The IPUMS API returns
stsCommandFile(notstataCommandFile) for Stata syntax files in some collections. Include both if unsure.
Returns { downloaded: [...], errors: [...] }. Each entry includes localPath, bytes, and sha256Verified.
| Tool | Description |
|---|---|
nhgis_search_datasets |
Search all datasets by keyword (name, census group, description) |
nhgis_search_data_tables |
Search data tables by keyword, optionally scoped to a dataset |
nhgis_search_time_series_tables |
Search all time series tables by keyword |
nhgis_list_datasets |
Browse available datasets (paginated) |
nhgis_get_dataset |
Get tables and geographic levels for a dataset |
nhgis_list_data_tables |
List all data tables (paginated) |
nhgis_get_data_table |
Get variables in a data table |
nhgis_list_time_series_tables |
List time series tables (paginated) |
nhgis_get_time_series_table |
Get detail for a time series table |
nhgis_list_shapefiles |
List available shapefiles |
nhgis_list_extracts |
List recent NHGIS extracts |
nhgis_get_extract |
Get NHGIS extract status and download links |
nhgis_create_extract |
Submit a new NHGIS extract |
nhgis_extract_to_code |
Generate reproducible R (ipumsr) or Python (ipumspy) code for an NHGIS extract |
Three tools enable natural-language discovery of NHGIS data, complementing the nhgis_list_* and nhgis_get_* tools:
nhgis_search_datasets — fetches all ~253 datasets in one call and returns those matching the keyword in their name, group (e.g. "2020 Census", "2019 ACS"), or description. Best first step when you know a topic or census year.
nhgis_search_time_series_tables — fetches all ~400 time series tables (which span multiple census years with consistent geographic definitions) and filters by keyword in name or description. Ideal for longitudinal analysis.
nhgis_search_data_tables — searches source data tables by keyword against description and universe:
- With
datasetparameter: fetches all tables for that dataset comprehensively. - Without
dataset: scans the first 2,500 tables across all datasets (limited coverage — see note in response).
Recommended workflow:
nhgis_search_datasets("poverty")
→ pick a dataset (e.g. "2019_ACS5a")
nhgis_get_dataset("2019_ACS5a")
→ confirm available tables and geographic levels
nhgis_search_data_tables("poverty", dataset="2019_ACS5a")
→ pick a table (e.g. "B17001")
nhgis_create_extract(...)
nhgis_extract_to_code(...)
→ generate R/Python code to reproduce the extract
The NHGIS public API does not support server-side filtering. The search tools fetch full metadata pages and filter client-side.
Three tools close the loop between "I pulled this data via AI" and "here's the exact code to pull it again."
| Tool | Description |
|---|---|
microdata_extract_to_code |
Generate R or Python code to reproduce a microdata extract |
nhgis_extract_to_code |
Generate R or Python code to reproduce an NHGIS extract |
generate_ipums_citation |
Return a formatted citation template with authors, DOI, and official citation link |
After submitting or retrieving an extract, call microdata_extract_to_code (or nhgis_extract_to_code) to get ready-to-run code:
microdata_extract_to_code({ collection: "usa", extractNumber: 42, language: "both" })
R output (ipumsr):
library(ipumsr)
extract <- define_extract_usa(
samples = c("us2022a"),
variables = c("VETSTAT", "AGE", "SEX", "STATEFIP"),
data_format = "csv",
data_structure = "rectangular"
)
extract <- submit_extract(extract)
extract <- wait_for_extract(extract)
path <- download_extract(extract)
ddi <- read_ipums_ddi(path$ddi)
data <- read_ipums_micro(ddi)Python output (ipumspy):
from ipumspy import IpumsApiClient, UsaExtract
import os
client = IpumsApiClient(api_key=os.environ["IPUMS_API_KEY"])
extract = UsaExtract(
samples=["us2022a"],
variables=["VETSTAT", "AGE", "SEX", "STATEFIP"],
data_format="csv"
)
extract = client.submit_extract(extract)
client.wait_for_extract(extract)
client.download_extract(extract, download_dir="./downloads")The language parameter accepts "r", "python", or "both" (default). For variables with case selections or data quality flags, R output uses var_spec() automatically. NHGIS output uses ds_spec()/tst_spec() for datasets and time series tables, with camelCase API keys converted to snake_case for both languages.
generate_ipums_citation returns the standard citation for any IPUMS collection — no API call needed:
generate_ipums_citation({ collection: "nhgis" })
Returns the full author list, DOI, and a direct link to the official citation page where the current version number is always listed. Replace [YEAR] and VX with the current year and version before publishing.
This server pairs naturally with a Jupyter MCP server to form a complete data pipeline — from browsing IPUMS metadata to live analysis in a notebook — without leaving your LLM client.
Step 1 — Browse and submit (via MCP tools in Claude) 🤖
1. nhgis_search_datasets → find datasets by keyword (topic, year, census program)
2. nhgis_search_data_tables → narrow to specific tables (scope to dataset for full coverage)
3. nhgis_get_dataset → confirm tables and geo levels
4. nhgis_create_extract → submit the extract
5. nhgis_get_extract → poll until status = "completed"
6. nhgis_extract_to_code → generate R/Python code to reproduce the extract
Or for microdata:
1. microdata_search_variables → find variable names by keyword or group
2. microdata_list_samples → confirm the right sample ID ("2022 ACS" → us2022a)
3. microdata_create_extract → submit (size advisory tells you whether to wait or check back)
4. microdata_wait_for_extract → poll up to 90s; bail out gracefully for large extracts
5. microdata_download_extract → download data + DDI codebook
6. microdata_extract_to_code → generate R/Python code to reproduce the extract
💡 Variable discovery: Use
microdata_search_variablesto look up mnemonics from natural language — search"income","veteran","race"— rather than relying on memorized variable names. Filter bysampleto confirm a variable is available in your target year. For NHGIS, usenhgis_search_datasetsandnhgis_search_data_tablesto find tables by keyword, thennhgis_get_data_tableto see all variables within a table.
Step 2 — Analyze in Jupyter with ipumsr (R kernel) 📓
Once microdata_download_extract has saved files locally, use the Jupyter MCP server to run R in a notebook:
library(ipumsr)
# Point at the DDI codebook downloaded by microdata_download_extract
ddi <- read_ipums_ddi("/data/ipums/usa_00042.xml")
data <- read_ipums_micro(ddi)
# Variable labels and value labels are attached automatically
head(data)
attr(data$VETSTAT, "labels")For Python notebooks 🐍:
import pandas as pd
# CSV data file downloaded by microdata_download_extract
df = pd.read_csv("/data/ipums/usa_00042.csv.gz")
df.head()For NHGIS data (already CSV, no DDI needed):
library(ipumsr)
# read_nhgis reads the CSV and attaches codebook metadata
tbl <- read_nhgis("/data/ipums/nhgis0007_csv.zip", data_layer = 1)Add both servers to your MCP client config:
{
"mcpServers": {
"ipums": {
"command": "node",
"args": ["/absolute/path/to/ipums-mcp/dist/index.js"],
"env": { "IPUMS_API_KEY": "your-key-here" }
},
"jupyter": {
"command": "uvx",
"args": ["jupyter-mcp-server"],
"env": { "JUPYTER_TOKEN": "your-jupyter-token" }
}
}
}With both servers running you can prompt your LLM client: "Submit a 2022 ACS extract for VETSTAT, AGE, SEX, STATEFIP, download it when ready, then open a Jupyter notebook and plot veteran counts by state with ggplot2." 🎯
- API key: Pass via
IPUMS_API_KEYenv var only. Never hard-code it or commit it to source control. For local development, store it in a.envfile (gitignored). - Download URL validation:
microdata_download_extractvalidates that each download URL uses HTTPS and resolves to an*.ipums.orghost before sending your API key. URLs that fail this check are skipped and reported inerrors. - Path traversal protection: Downloaded filenames are resolved against
outputDirand rejected if the resolved path would escape that directory. - SHA-256 verification: Every downloaded file is verified against the checksum provided by the API. A mismatch is reported in
errorsbut the file is kept so you can inspect it. - Error message sanitization: API error responses from 4xx errors are HTML-stripped and truncated before being surfaced; 5xx server error bodies are suppressed entirely.
- Generated code safety: R code emitted by
microdata_extract_to_codeandnhgis_extract_to_codeescapes all string values to prevent code injection from API-returned identifiers.
Create a .env file in the project root (already gitignored):
IPUMS_API_KEY=your-key-hereThen run:
export $(cat .env | xargs) && npm run dev # load key + run via tsx
npm run build # compile TypeScript → dist/Source layout:
src/
index.ts entry point
client.ts shared HTTP client (ipumsRequest, getApiKey)
types.ts Zod schemas
data/
usa-variables.ts offline DB — 827 harmonized IPUMS USA variables
tools/
microdata.ts microdata tools (search vars, list, get, create, wait, download)
nhgis.ts NHGIS tools (search datasets/tables/TSTs + list/get/create/extract)
codegen.ts reproducibility tools (extract→code for R/Python, citation lookup)
MIT — © 2026 hesscl