TCGA-Tools is a Python package that provides a clean, modular interface for downloading and organizing datasets from the NCI Genomic Data Commons (GDC) portal. It allows you to fetch raw data (e.g., .svs diagnostic slides, sequencing data) together with directly usable annotations (clinical, molecular, diagnostic reports, etc.), and automatically groups data at the patient level for easy analysis.
- Simple one-liner to fetch project files (e.g., whole-slide images /
.svs). - Write analysis-ready CSVs with file metadata and patient grouping.
- Emit optional annotation CSVs: clinical (survival/outcomes/treatments), molecular (DNA/RNA/CNV/methylation), free-text reports, and diagnosis/subtype.
- Be resilient to missing or sparse fields across projects.
- Clean, modular architecture with explicit ports/adapters for GDC and TCIA.
- GDC and TCIA downloads built on top of the
gdc-api-wrapper. - Multi-dataset support (download one or multiple TCGA projects at once).
- Annotation options:
clinical: survival, treatment outcomes, patient metadatamolecular: genomic, transcriptomic, and methylation datareport: free-text pathology or clinical reportsdiagnosis: tumor subtype and diagnostic informationall: fetch everything available
- Progress bars for downloads.
- Logging of all transformations for reproducibility.
raw=Trueoption for βdry runsβ (inspect available data without downloading).- Optional statistics and visualizations: class distributions, survival curves, annotation summaries.
pip install tcga-toolsuv pip install tcga-toolspip install gdc-api-wrappergit clone https://github.com/LUMCPathAI/TCGA-Tools.git
cd TCGA-Tools
pip install -e .import tcga_tools as tt
tt.Download(
dataset_name="TCGA-LUSC",
filetypes=[".svs"],
datatype=["WSI"],
annotations=["clinical", "molecular", "report"],
output_dir="./TCGA-LUSC",
statistics=True,
visualizations=True
)
#Download multiple datasets
tt.Download(
dataset_name=["TCGA-LUSC", "TCGA-LUAD", "TCGA-BRCA"], # list of datasets
filetypes=[".svs", ".maf"], # multiple file types
annotations="all", # fetch everything
output_dir="./TCGA",
)Use the high-level portal to query pathology metadata and download slides from TCGA (GDC) and TCIA using clean, modular services.
from tcga_tools.pathology import PathologyDataPortal
from tcga_tools.services.tcia_pathology import TciaSeriesQuery
portal = PathologyDataPortal()
# --- TCIA: SOP Instance lookup and downloads ---
query = TciaSeriesQuery(series_instance_uid="uid.series.instance", format_="JSON")
sop_result = portal.list_tcia_sop_instance_uids(query)
portal.download_tcia_series(series_instance_uid="uid.series.instance", output_dir="./TCIA")
# --- TCGA: download pathology files via GDC wrapper ---
tcga_files = portal.download_tcga_project(
project_id="TCGA-LUSC",
filetypes=[".svs"],
output_dir="./TCGA-LUSC",
)TCIA endpoints supported via the wrapper:
- SOPInstanceUID lookup for a SeriesInstanceUID (
sop_instance_uids) - Single-image download for a SeriesInstanceUID + SOPInstanceUID
- Series download as a zip file
import tcga_tools as tt
tt.Download(
dataset_name=["TCGA-LUAD", "TCGA-LUSC"],
filetypes=[".svs"],
annotations=["clinical", "diagnosis"],
output_dir="./TCGA-LUNG",
)import tcga_tools as tt
tt.Download(
dataset_name="TCGA-SKCM",
filetypes=[".svs"],
annotations=["molecular"],
output_dir="./TCGA-SKCM",
)from tcga_tools.pathology import PathologyDataPortal
from tcga_tools.services.tcia_pathology import TciaSeriesQuery
portal = PathologyDataPortal()
# Suppose you already have SeriesInstanceUIDs for a TCIA collection
series_uids = [
"uid.series.instance.1",
"uid.series.instance.2",
]
for series_uid in series_uids:
sop_payload = portal.list_tcia_sop_instance_uids(
TciaSeriesQuery(series_instance_uid=series_uid, format_="JSON")
)
portal.download_tcia_series(series_instance_uid=series_uid, output_dir="./TCIA-DATASET")-
Summary log of transformations and queries
-
Distributions of diagnosis categories
-
Survival curves based on clinical annotations
-
Counts per file type and annotation
data/(downloads)files_metadata.csv(flattened file + case/sample fields)groups.csv(per-case: paired / tumor_only / normal_only)clinical.csv,molecular_index.csv,reports_index.csv,diagnosis.csv(if requested)gdc_manifest.tsv(for the GDC Transfer Tool)
If you need controlled-access files, set an environment variable with your token:
export GDC_TOKEN="<your token>"import tcga_tools as tt
tt.list_datasets()Pass any subset of:
"clinical"β survival/clinical outcome/treatment effect (diagnoses, treatments, follow-ups, exposures)"molecular"β DNA/RNA/CNV/Methylation file index"report"β free-text/clinical/pathology reports (XML/PDF)"diagnosis"β diagnostic subtype, morphology, stage/grade"all"β everything above
GDC projects vary in completeness. TCGA-Tools is defensive:
- Broad field requests; if the API rejects fields (HTTP 400), it retries without fields to maximize returned content.
- JSON is flattened into wide CSVs; absent fields simply do not appear, or appear with empty values.
- Grouping logic remains robust even if sample types are missing.
python -m tcga_tools --dataset TCGA-LUSC --filetypes .svs \
--annotations clinical molecular report diagnosis --out ./TCGA-LUSC- Python β₯ 3.9
- Tested on Linux, macOS, Windows
- Dependencies are listed in
pyproject.tomland installed automatically.
All downloads and transformations are logged to download.log in your output directory for reproducibility.
Preview available data without downloading:
tt.Download(dataset_name="TCGA-LUSC", raw=True)Run unit tests:
pytest tests/- For very large downloads, prefer the emitted
gdc_manifest.tsvwith the GDC Data Transfer Tool. - Extend
config.pyto add/modify field lists or filetype preferences as needed.
Apache 2.0 β free for research and commercial use.
Contributions are welcome! Please open an issue or PR on GitHub.
