TCGA-Tools

TCGA-Tools is a Python package that provides a clean, modular interface for downloading and organizing datasets from the NCI Genomic Data Commons (GDC) portal. It allows you to fetch raw data (e.g., .svs diagnostic slides, sequencing data) together with directly usable annotations (clinical, molecular, diagnostic reports, etc.), and automatically groups data at the patient level for easy analysis.

Goals

Simple one-liner to fetch project files (e.g., whole-slide images / .svs).
Write analysis-ready CSVs with file metadata and patient grouping.
Emit optional annotation CSVs: clinical (survival/outcomes/treatments), molecular (DNA/RNA/CNV/methylation), free-text reports, and diagnosis/subtype.
Be resilient to missing or sparse fields across projects.

🚀 Features

Clean, modular architecture with explicit ports/adapters for GDC and TCIA.
GDC and TCIA downloads built on top of the gdc-api-wrapper.
Multi-dataset support (download one or multiple TCGA projects at once).
Annotation options:
- clinical: survival, treatment outcomes, patient metadata
- molecular: genomic, transcriptomic, and methylation data
- report: free-text pathology or clinical reports
- diagnosis: tumor subtype and diagnostic information
- all: fetch everything available
Progress bars for downloads.
Logging of all transformations for reproducibility.
raw=True option for “dry runs” (inspect available data without downloading).
Optional statistics and visualizations: class distributions, survival curves, annotation summaries.

📦 Installation

From PyPI (pip)

pip install tcga-tools

From PyPI (uv)

uv pip install tcga-tools

Optional Pathology Dependencies (TCIA + GDC wrapper)

pip install gdc-api-wrapper

From Source

git clone https://github.com/LUMCPathAI/TCGA-Tools.git
cd TCGA-Tools
pip install -e .

Quickstart

import tcga_tools as tt

tt.Download(
    dataset_name="TCGA-LUSC",
    filetypes=[".svs"],
    datatype=["WSI"],
    annotations=["clinical", "molecular", "report"],
    output_dir="./TCGA-LUSC",
    statistics=True,
    visualizations=True
)

#Download multiple datasets
tt.Download(
    dataset_name=["TCGA-LUSC", "TCGA-LUAD", "TCGA-BRCA"],  # list of datasets
    filetypes=[".svs", ".maf"],                            # multiple file types
    annotations="all",                                     # fetch everything
    output_dir="./TCGA",
)

🧬 Pathology Portal (TCGA + TCIA)

Use the high-level portal to query pathology metadata and download slides from TCGA (GDC) and TCIA using clean, modular services.

from tcga_tools.pathology import PathologyDataPortal
from tcga_tools.services.tcia_pathology import TciaSeriesQuery

portal = PathologyDataPortal()

# --- TCIA: SOP Instance lookup and downloads ---
query = TciaSeriesQuery(series_instance_uid="uid.series.instance", format_="JSON")
sop_result = portal.list_tcia_sop_instance_uids(query)
portal.download_tcia_series(series_instance_uid="uid.series.instance", output_dir="./TCIA")

# --- TCGA: download pathology files via GDC wrapper ---
tcga_files = portal.download_tcga_project(
    project_id="TCGA-LUSC",
    filetypes=[".svs"],
    output_dir="./TCGA-LUSC",
)

📚 TCIA Pathology Workflows

TCIA endpoints supported via the wrapper:

SOPInstanceUID lookup for a SeriesInstanceUID (sop_instance_uids)
Single-image download for a SeriesInstanceUID + SOPInstanceUID
Series download as a zip file

✅ Usage Examples

1) Download all TCGA lung slides with subtype labels + clinical endpoints

import tcga_tools as tt

tt.Download(
    dataset_name=["TCGA-LUAD", "TCGA-LUSC"],
    filetypes=[".svs"],
    annotations=["clinical", "diagnosis"],
    output_dir="./TCGA-LUNG",
)

2) Download all TCGA-SKCM slides with genetic information

import tcga_tools as tt

tt.Download(
    dataset_name="TCGA-SKCM",
    filetypes=[".svs"],
    annotations=["molecular"],
    output_dir="./TCGA-SKCM",
)

3) Download all TCIA radiology and pathology images from a dataset

from tcga_tools.pathology import PathologyDataPortal
from tcga_tools.services.tcia_pathology import TciaSeriesQuery

portal = PathologyDataPortal()

# Suppose you already have SeriesInstanceUIDs for a TCIA collection
series_uids = [
    "uid.series.instance.1",
    "uid.series.instance.2",
]

for series_uid in series_uids:
    sop_payload = portal.list_tcia_sop_instance_uids(
        TciaSeriesQuery(series_instance_uid=series_uid, format_="JSON")
    )
    portal.download_tcia_series(series_instance_uid=series_uid, output_dir="./TCIA-DATASET")

📊 Example Outputs (with statistics=True, visualizations=True)

Summary log of transformations and queries
Distributions of diagnosis categories
Survival curves based on clinical annotations
Counts per file type and annotation

Outputs

data/ (downloads)
files_metadata.csv (flattened file + case/sample fields)
groups.csv (per-case: paired / tumor_only / normal_only)
clinical.csv, molecular_index.csv, reports_index.csv, diagnosis.csv (if requested)
gdc_manifest.tsv (for the GDC Transfer Tool)

Authentication

If you need controlled-access files, set an environment variable with your token:

export GDC_TOKEN="<your token>"

Checking available datasets

import tcga_tools as tt
tt.list_datasets()

Annotations argument

Pass any subset of:

"clinical" — survival/clinical outcome/treatment effect (diagnoses, treatments, follow-ups, exposures)
"molecular" — DNA/RNA/CNV/Methylation file index
"report" — free-text/clinical/pathology reports (XML/PDF)
"diagnosis" — diagnostic subtype, morphology, stage/grade
"all" — everything above

Handling missing data

GDC projects vary in completeness. TCGA-Tools is defensive:

Broad field requests; if the API rejects fields (HTTP 400), it retries without fields to maximize returned content.
JSON is flattened into wide CSVs; absent fields simply do not appear, or appear with empty values.
Grouping logic remains robust even if sample types are missing.

CLI

python -m tcga_tools --dataset TCGA-LUSC --filetypes .svs \
  --annotations clinical molecular report diagnosis --out ./TCGA-LUSC

Requirements

Python ≥ 3.9
Tested on Linux, macOS, Windows
Dependencies are listed in pyproject.toml and installed automatically.

Logging

All downloads and transformations are logged to download.log in your output directory for reproducibility.

Raw mode

Preview available data without downloading:

tt.Download(dataset_name="TCGA-LUSC", raw=True)

Testing

Run unit tests:

pytest tests/

Notes

For very large downloads, prefer the emitted gdc_manifest.tsv with the GDC Data Transfer Tool.
Extend config.py to add/modify field lists or filetype preferences as needed.

License

Apache 2.0 — free for research and commercial use.

Contributing

Contributions are welcome! Please open an issue or PR on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tcga_tools		tcga_tools
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
tcga_tools.png		tcga_tools.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCGA-Tools

Goals

🚀 Features

📦 Installation

From PyPI (pip)

From PyPI (uv)

Optional Pathology Dependencies (TCIA + GDC wrapper)

From Source

Quickstart

🧬 Pathology Portal (TCGA + TCIA)

📚 TCIA Pathology Workflows

✅ Usage Examples

1) Download all TCGA lung slides with subtype labels + clinical endpoints

2) Download all TCGA-SKCM slides with genetic information

3) Download all TCIA radiology and pathology images from a dataset

📊 Example Outputs (with statistics=True, visualizations=True)

Outputs

Authentication

Checking available datasets

Annotations argument

Handling missing data

CLI

Requirements

Logging

Raw mode

Testing

Notes

License

Contributing

About

Uh oh!

Releases

Packages

Languages

LUMCPathAI/TCGA-Tools

Folders and files

Latest commit

History

Repository files navigation

TCGA-Tools

Goals

🚀 Features

📦 Installation

From PyPI (pip)

From PyPI (uv)

Optional Pathology Dependencies (TCIA + GDC wrapper)

From Source

Quickstart

🧬 Pathology Portal (TCGA + TCIA)

📚 TCIA Pathology Workflows

✅ Usage Examples

1) Download all TCGA lung slides with subtype labels + clinical endpoints

2) Download all TCGA-SKCM slides with genetic information

3) Download all TCIA radiology and pathology images from a dataset

📊 Example Outputs (with statistics=True, visualizations=True)

Outputs

Authentication

Checking available datasets

Annotations argument

Handling missing data

CLI

Requirements

Logging

Raw mode

Testing

Notes

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages