Skip to content
/ cfdb Public

A Python utility for parsing and normalizing various DCC datapackages.

Notifications You must be signed in to change notification settings

abdenlab/cfdb

Repository files navigation

Common Fund Database

CFDB is a Python package for querying and serving C2M2 (Crosscut Metadata Model) file metadata from Common Fund Data Coordinating Centers (DCCs).

Installation

pip install git+https://github.com/abdenlab/cfdb.git

Requires Python 3.10 or later.

Setup

Prerequisites

  • Docker - For running MongoDB and the API

Environment Variables

Variable Description Default
SYNC_API_KEY API key for the sync endpoint (required - API won't start without it) -
SYNC_DATA_DIR Directory for downloaded sync data files -
CFDB_API_URL Base URL for the cfdb API http://localhost:8000
DATABASE_URL MongoDB connection string mongodb://localhost:27017
MONGODB_TLS_ENABLED Enable X.509 certificate authentication (production) false
MONGODB_CERT_PATH Path to client certificate bundle /etc/cfdb/certs/client-bundle.pem
MONGODB_CA_PATH Path to CA certificate /etc/cfdb/certs/ca.pem

Quick Start

# 1. Start MongoDB (restores sample data and creates indexes)
make mongodb

# 2. Start the API server
make api

# 3. (Optional) Sync latest DCC metadata
curl -X POST -H "X-API-Key: dev-sync-key" http://localhost:8000/sync

This starts:

  • MongoDB on port 27017 (with indexes)
  • GraphQL/REST API on port 8000

Production Deployment (TLS/X.509)

For production, MongoDB uses TLS encryption with X.509 certificate authentication:

# 1. Generate certificates (customize hostname/IP as needed)
./certs/generate-certs.sh mongodb.example.com 10.0.1.50

# Or use environment variables
MONGODB_HOSTNAME=mongodb.example.com MONGODB_IP=10.0.1.50 ./certs/generate-certs.sh

# 2. Start MongoDB with TLS
make mongodb-prod

# 3. Start API with client certificate
make api-prod

The certificate script generates:

  • certs/ca/ca.pem - CA certificate (deploy to all containers)
  • certs/server/mongodb-server-bundle.pem - MongoDB server certificate
  • certs/clients/cfdb-api-bundle.pem - API client certificate
  • certs/clients/cfdb-materializer-bundle.pem - Materializer client certificate

Run ./certs/generate-certs.sh --help for full usage information.

Makefile Targets

Target Description
make mongodb Build and start MongoDB with sample data and indexes
make api Build and start the API container
make materialize-files Manually materialize all file metadata (usually done via sync)
make materialize-dcc DCC=hubmap Materialize a single DCC
make certs Generate TLS certificates for production
make mongodb-prod Start MongoDB with TLS/X.509 authentication
make api-prod Start API with X.509 client certificate

Sync Workflow

The sync endpoint (POST /sync) handles the full data refresh:

  1. Downloads C2M2 datapackages from DCCs
  2. Loads data into underlying MongoDB collections
  3. Runs the Rust materializer to create the fully-joined files collection

The materializer is included in the API Docker image and runs automatically after each DCC sync.

API Usage

GraphQL Endpoint

URL: POST /metadata

Query file metadata using GraphQL. The API exposes two queries:

files Query

Returns a paginated list of files matching the input criteria.

query {
  files(
    input: [FileMetadataInput]
    page: Int = 0
    pageSize: Int = 100
  ) {
    idNamespace
    localId
    filename
    sizeInBytes
    dcc {
      dccAbbreviation
      dccName
    }
    fileFormat {
      name
    }
    collections {
      name
      biosamples {
        anatomy {
          name
        }
      }
    }
  }
}
# Query all files (first page)
curl -X POST http://localhost:8000/metadata \
  -H "Content-Type: application/json" \
  -d '{"query": "{ files { filename sizeInBytes dcc { dccAbbreviation } } }"}'

# Query files with pagination
curl -X POST http://localhost:8000/metadata \
  -H "Content-Type: application/json" \
  -d '{"query": "{ files(page: 0, pageSize: 10) { filename } }"}'

# Query files from a specific DCC
curl -X POST http://localhost:8000/metadata \
  -H "Content-Type: application/json" \
  -d '{"query": "{ files(input: [{ dcc: [{ dccAbbreviation: [\"4DN\"] }] }]) { filename dcc { dccAbbreviation } } }"}'

file Query

Returns a single file by its MongoDB ObjectId.

query {
  file(id: "507f1f77bcf86cd799439011") {
    filename
    accessUrl
  }
}
curl -X POST http://localhost:8000/metadata \
  -H "Content-Type: application/json" \
  -d '{"query": "{ file(id: \"507f1f77bcf86cd799439011\") { filename accessUrl } }"}'

Data Model

The API serves file metadata following the C2M2 data model. Below is the complete schema.

FileMetadataModel

The central entity representing a stable digital asset.

Field Type Description
id_namespace string CFDE-cleared identifier for the top-level data space (PK part 1)
local_id string Identifier unique within the namespace (PK part 2)
dcc DCC The Data Coordinating Center that produced this file
collections Collection[] Collections containing this file
project Project? The primary project within which this file was created
project_id_namespace string Project namespace (FK part 1)
project_local_id string Project local ID (FK part 2)
persistent_id string? Permanent URI or compact ID
creation_time string? ISO 8601 timestamp
size_in_bytes int? File size
sha256 string? SHA-256 checksum (preferred)
md5 string? MD5 checksum (if SHA-256 unavailable)
filename string Filename without path
file_format FileFormat? EDAM CV term for digital format
compression_format string? EDAM CV term for compression (e.g., gzip)
data_type DataType? EDAM CV term for data type
assay_type AssayType? OBI CV term for experiment type
analysis_type string? OBI CV term for analysis type
mime_type string? MIME type
bundle_collection_id_namespace string? Bundle collection namespace
bundle_collection_local_id string? Bundle collection local ID
dbgap_study_id string? dbGaP study ID for access control
access_url string? DRS URI or publicly accessible URL
status string? Dataset status (e.g., "Published", "QA") - HuBMAP specific
data_access_level string? Access level: public, consortium, or protected - HuBMAP specific
DCC

A Common Fund program or Data Coordinating Center.

Field Type Description
id string CFDE-CC issued identifier
dcc_name string Human-readable label
dcc_abbreviation string Short display label
dcc_description string? Human-readable description
contact_email string Primary technical contact email
contact_name string Primary technical contact name
dcc_url string DCC website URL
project_id_namespace string Project namespace
project_local_id string Project local ID
Collection

A grouping of files, biosamples, and/or subjects.

Field Type Description
id_namespace string Collection namespace (PK part 1)
local_id string Collection local ID (PK part 2)
biosamples Biosample[] Biosamples in this collection
subjects Subject[] Subjects (donors) directly in this collection
anatomy Anatomy[] Anatomy terms associated with this collection
persistent_id string? Permanent URI
creation_time string? ISO 8601 timestamp
abbreviation string? Short display label
name string Human-readable label
description string? Human-readable description
Biosample

A tissue sample or other physical specimen.

Field Type Description
id_namespace string Biosample namespace (PK part 1)
local_id string Biosample local ID (PK part 2)
project_id_namespace string Project namespace (FK part 1)
project_local_id string Project local ID (FK part 2)
persistent_id string? Permanent URI
creation_time string? ISO 8601 timestamp
sample_prep_method string? OBI CV term for preparation method
anatomy Anatomy? UBERON CV term for anatomical origin
biofluid string? UBERON/InterLex term for fluid origin
subjects Subject[] Subjects (donors) from which this biosample was derived
Anatomy

An UBERON (Uber-anatomy ontology) CV term.

Field Type Description
id string UBERON CV term identifier
name string Human-readable label
description string? Human-readable description
FileFormat

An EDAM CV 'format:' term describing digital format.

Field Type Description
id string EDAM format term identifier
name string Human-readable label
description string? Human-readable description
DataType

An EDAM CV 'data:' term describing the type of data.

Field Type Description
id string EDAM data term identifier
name string Human-readable label
description string? Human-readable description
AssayType

An OBI (Ontology for Biomedical Investigations) CV term describing experiment types.

Field Type Description
id string OBI CV term identifier
name string Human-readable label
description string? Human-readable description
Subject

A human or organism from which biosamples are derived.

Field Type Description
id_namespace string Subject namespace (PK part 1)
local_id string Subject local ID (PK part 2)
project_id_namespace string Project namespace (FK part 1)
project_local_id string Project local ID (FK part 2)
persistent_id string? Permanent URI
creation_time string? ISO 8601 timestamp
granularity string? CFDE CV term (single organism, cell line, microbiome, etc.)
sex string? NCIT CV term for biological sex
ethnicity string? NCIT CV term for self-reported ethnicity
age_at_enrollment float? Age in years when enrolled in primary project
age_at_sampling float? Age in years when biosample was taken
race string[] CFDE CV terms for self-identified race(s)
taxonomy NCBITaxonomy? NCBI taxonomy for the subject's organism
NCBITaxonomy

An NCBI Taxonomy term for organism classification.

Field Type Description
id string NCBI Taxonomy Database ID (e.g., NCBI:txid9606)
name string Taxonomy name (e.g., "Homo sapiens")
clade string? Phylogenetic level (e.g., species, genus)
description string? Human-readable description
Project

A node in the C2M2 project hierarchy.

Field Type Description
id_namespace string Project namespace (PK part 1)
local_id string Project local ID (PK part 2)
name string Human-readable label
abbreviation string? Short display label
description string? Human-readable description
persistent_id string? Permanent URI or compact ID

Query Mechanics

The GraphQL API uses an implicit OR/AND clause system for building MongoDB queries.

How It Works:

  1. Lists become OR clauses: Multiple values in an array are combined with $or
  2. Dict keys become AND clauses: Multiple fields in an object are combined with $and
Simple Query - Single Value
query {
  files(input: [{ filename: ["data.csv"] }]) {
    filename
  }
}

MongoDB query:

{ "filename": "data.csv" }
OR Query - Multiple Values in a List

Find files with either filename:

query {
  files(input: [{ filename: ["data.csv", "results.tsv"] }]) {
    filename
  }
}

MongoDB query:

{ "$or": [{ "filename": "data.csv" }, { "filename": "results.tsv" }] }
AND Query - Multiple Fields

Find files matching both criteria:

query {
  files(input: [{
    filename: "data.csv",
    dcc: { dccAbbreviation: ["4DN"] }
  }]) {
    filename
    dcc { dccAbbreviation }
  }
}

MongoDB query:

{
  "$and": [
    { "filename": "data.csv" },
    { "dcc.dcc_abbreviation": "4DN" }
  ]
}
Combined OR/AND Query

Find files from 4DN OR HuBMAP with specific file formats:

query {
  files(input: [{
    dcc: [
      { dccAbbreviation: ["4DN"] },
      { dccAbbreviation: ["HuBMAP"] }
    ],
    fileFormat: { name: "FASTQ" }
  }]) {
    filename
    dcc { dccAbbreviation }
    fileFormat { name }
  }
}

MongoDB query:

{
  "$and": [
    { "$or": [
      { "dcc.dcc_abbreviation": "4DN" },
      { "dcc.dcc_abbreviation": "HuBMAP" }
    ]},
    { "file_format.name": "FASTQ" }
  ]
}
Nested Entity Query

Find files from biosamples with specific anatomy:

query {
  files(input: [{
    collections: {
      biosamples: {
        anatomy: { name: "heart" }
      }
    }
  }]) {
    filename
    collections {
      biosamples {
        anatomy { name }
      }
    }
  }
}
Pagination

Use page and pageSize parameters:

query {
  files(page: 0, pageSize: 50) {
    filename
  }
}

Entity Relationships

The data model uses MongoDB aggregation pipelines to join related entities:

file
├── dcc (DCC) ─────────────────── via submission field
├── project (Project) ─────────── via project FK
├── file_format (FileFormat) ──── via file_format ID
├── data_type (DataType) ──────── via data_type ID
├── assay_type (AssayType) ────── via assay_type ID
└── collections[] (Collection) ── via file_in_collection
    ├── anatomy[] (Anatomy) ───── via collection_anatomy
    ├── subjects[] (Subject) ──── via subject_in_collection
    │   └── taxonomy (NCBITaxonomy) ── via subject_role_taxonomy
    └── biosamples[] (Biosample) ─ via biosample_in_collection
        ├── anatomy (Anatomy) ──── via anatomy ID
        └── subjects[] (Subject) ─ via biosample_from_subject
            └── taxonomy (NCBITaxonomy) ── via subject_role_taxonomy

Cross-reference tables:

  • file_in_collection - Links files to collections
  • biosample_in_collection - Links biosamples to collections
  • subject_in_collection - Links subjects directly to collections
  • biosample_from_subject - Links biosamples to their source subjects
  • collection_anatomy - Links anatomy terms to collections
  • subject_role_taxonomy - Links subjects to NCBI taxonomy terms

GraphiQL IDE

URL: GET /metadata

Visit http://localhost:8000/metadata in your browser to access GraphiQL, an interactive IDE for exploring and testing GraphQL queries.

Features:

  • Schema Documentation - Browse all available types, fields, and their descriptions
  • Query Editor - Write queries with syntax highlighting and error detection
  • Autocomplete - Get field suggestions as you type (Ctrl+Space)
  • Query History - Access previously executed queries
  • Response Viewer - See formatted JSON results

File Streaming Endpoint

URL: GET /data/{dcc}/{local_id} | HEAD /data/{dcc}/{local_id}

Stream file contents from DCCs via HTTPS. Supports both GET (download) and HEAD (metadata only) requests.

Path Parameters:

  • dcc - DCC abbreviation (e.g., 4dn, hubmap) - case insensitive
  • local_id - The file's unique ID within the DCC

Headers:

  • Range (optional) - Supports bytes=start-end for partial content requests

Response Codes:

Code Description
200 Full file content (GET) or file metadata (HEAD)
206 Partial content (Range request)
400 Invalid DCC or Range header
403 File requires authentication (consortium/protected access)
404 File not found
501 No supported access method (e.g., Globus-only files)
502 Upstream service error
504 Service timeout

Example:

# Check file availability (HEAD request)
curl -I http://localhost:8000/data/4dn/abc123

# Download a 4DN file
curl -O http://localhost:8000/data/4dn/abc123

# Download with Range header
curl -H "Range: bytes=0-1023" http://localhost:8000/data/hubmap/xyz789

Sync Endpoint

URL: POST /sync

Trigger a sync of C2M2 datapackages from DCCs. Requires API key authentication.

Behavior:

  • Single sync at a time - Only one sync task can run at a time. Concurrent requests return 409 Conflict.
  • Background execution - The endpoint returns immediately with a 202 Accepted response while the sync runs in the background.
  • Sync process - For each DCC, the sync: downloads the datapackage, extracts it, clears existing DCC data, loads new data, materializes files, then cleans up temporary files.
  • Materialization - After loading each DCC's data, the Rust materializer runs to create the denormalized files collection with all joins pre-computed. This is incremental - only the synced DCC's files are updated.
  • Database cutover - During the clear/load phase, API requests (GraphQL queries and file streaming) are briefly blocked to ensure data consistency. Requests wait for the cutover to complete before proceeding.

Headers:

  • X-API-Key (required) - API key matching SYNC_API_KEY environment variable

Query Parameters:

  • dccs (optional, repeatable) - DCC names to sync. If omitted, syncs all DCCs.

Response Codes:

Code Description
202 Sync started successfully
401 Invalid API key
409 A sync is already in progress
500 Server configuration error

Example:

# Sync all DCCs
curl -X POST -H "X-API-Key: your-key" http://localhost:8000/sync

# Sync specific DCCs
curl -X POST -H "X-API-Key: your-key" "http://localhost:8000/sync?dccs=4dn&dccs=hubmap"

Sync Status Endpoint

URL: GET /sync/{task_id}

Check the status of a sync task.

Path Parameters:

  • task_id - The task ID returned when starting a sync

Response:

{
  "task_id": "abc-123",
  "status": "running",
  "dcc_names": ["4dn", "hubmap"],
  "started_at": "2024-01-15T10:30:00",
  "completed_at": null
}

Response Codes:

Code Description
200 Task status returned
404 Task not found

Example:

# Start a sync and get task ID
curl -X POST -H "X-API-Key: your-key" "http://localhost:8000/sync?dccs=4dn"
# Returns: {"task_id": "abc-123", ...}

# Check sync status
curl http://localhost:8000/sync/abc-123

CLI Usage

cfdb sync

Trigger a sync via the cfdb API.

# Sync all DCCs
cfdb sync

# Sync specific DCCs
cfdb sync 4dn hubmap

# Specify API URL
cfdb sync --api-url http://api.example.com 4dn

# Specify API key (or set SYNC_API_KEY env var)
cfdb sync --api-key your-key

Options:

  • --api-url - cfdb API base URL (default: http://localhost:8000, env: CFDB_API_URL)
  • --api-key - API key for sync endpoint (env: SYNC_API_KEY)
  • --debug / -d - Enable debugpy debugging

HuBMAP Data Portal Filter Mapping

The following table maps HuBMAP data portal search dimensions to CFDB/C2M2 fields:

Category HuBMAP Dimension CFDB Field Status Notes
Dataset Dataset/Assay Type assay_type.name OBI CV terms (CODEX, RNA-seq, etc.)
Dataset Data Type data_type.name EDAM CV terms
Dataset File Format file_format.name EDAM CV terms
Dataset Data Access Level data_access_level public/consortium/protected
Dataset Status status Published/QA (HuBMAP-specific)
Dataset DCC/Affiliation dcc.dcc_abbreviation Data provider
Organ Organ collections.anatomy.name UBERON CV terms
Sample Sample Prep Method collections.biosamples.sample_prep_method OBI CV terms
Sample Biofluid collections.biosamples.biofluid UBERON/InterLex terms
Donor Sex collections.subjects.sex NCIT CV terms
Donor Age collections.subjects.age_at_enrollment Decimal years
Donor Age at Sampling collections.biosamples.subjects.age_at_sampling Decimal years
Donor Race collections.subjects.race CFDE CV terms (multi-valued)
Donor Ethnicity collections.subjects.ethnicity NCIT CV terms
Donor Granularity collections.subjects.granularity single organism/cell line/etc.
Donor BMI Not in C2M2
Donor Height/Weight Not in C2M2
Donor Medical History Diabetes, hypertension, etc.
Donor Lifestyle Smoking, alcohol, drug use
Donor Cause of Death Not in C2M2
Donor Blood Type Not in C2M2
Processing Pipeline analysis_type ⚠️ Partial - OBI CV terms
Processing Processing Type HuBMAP-specific

Legend: ✅ Supported | ⚠️ Partial | ❌ Not Available

4DN Data Portal Filter Mapping

The following table maps 4DN data portal search dimensions to CFDB/C2M2 fields:

Category 4DN Dimension CFDB Field Status Notes
Experiment Experiment Type assay_type.name OBI CV terms (Hi-C, etc.)
Experiment Data Category data_type.name Sequencing vs Microscopy
File File Format file_format.name EDAM CV terms
File File Size size_in_bytes Integer bytes
Sample Tissue/Anatomy collections.anatomy.name UBERON CV terms
Sample Sample Prep collections.biosamples.sample_prep_method OBI CV terms
Sample Biosource/Cell Line collections.biosamples.local_id ⚠️ Cell line in biosample ID
Sample Organism collections.subjects.taxonomy.name NCBI taxonomy
Sample Cell Line Tier 4DN-specific
Dataset Dataset/Collection collections.name Collection grouping
Dataset Publication/DOI collections.persistent_id ⚠️ If DOI linked
Dataset Condition 4DN-specific
Provider DCC dcc.dcc_abbreviation Always "4DN"
Provider Lab/Project project.name Via project FK

Legend: ✅ Supported | ⚠️ Partial | ❌ Not Available

About

A Python utility for parsing and normalizing various DCC datapackages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published