This repository provides tools and scripts for extracting and adding annotations to EMDB entries, which are used to enhance the metadata associated with EM datasets.
- Installation
- Docker Installation
- Configuration
- Usage
- Docker Usage
- Contributing
- License
To install the necessary dependencies, run: pip install -r requirements.txt
You can also run the scripts using Docker, which provides a containerized environment with all dependencies pre-installed.
docker build -t added-annotations .This will create a Docker image with Python 3.8, BLAST+, and all required Python dependencies.
The repository uses a config.ini file for configuration, which is not included in the repository. This file should be created in the root directory of the project with the following structure:
[file_paths]
uniprot_tab = <path_to_file>/uniprot.tsv
CP_ftp = <path_to_file>/complextab
components_cif = <path_to_file>/components.cif
chem_comp_list = <path_to_file>/chem_comp_list.xml
pmc_ftp_gz = <path_to_file>/PMID_PMCID_DOI.csv.gz
pmc_ftp = <path_to_file>/PMID_PMCID_DOI.csv
emdb_pubmed = <path_to_file>/emdb_pubmed.log
emdb_orcid = <path_to_file>/emdb_orcid.log
assembly_ftp = <path_to_file>/assembly/
BLAST_DB = <path_to_file>/ncbi-blast-2.13.0+/database/uniprot_sprot
BLASTP_BIN = blastp
sifts_GO = <path_to_file>/pdb_chain_go.csv
GO_obo = <path_to_file>/go.obo
GO_interpro = /nfs/ftp/pub/databases/GO/goa/external2go/interpro2go
sifts = <path_to_file>/split_xml/
alphafold_ftp = <path_to_file>/accession_ids.txt
rfam_ftp = <path_to_file>/rfam_files_combined.txt
[api]
pmc = https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
[params]
minimal_map_fragment_length = 15
When using Docker, the config.ini file should use container paths. An example configuration file is provided in config.ini.docker-example. Create your own config file on the host machine with the following structure:
[file_paths]
CP_ftp = /data/cpx/
components_cif = /data/components.cif
pmc_ftp_gz = /data/pmc/PMID_PMCID_DOI.csv.gz
pmc_ftp = /data/pmc/PMID_PMCID_DOI.csv
assembly_ftp = /data/pdbe/assembly/
BLAST_DB = /data/uniprotkb_swissprot
BLASTP_BIN = blastp
sifts_GO = /data/pdbe/go/pdb_chain_go.csv
GO_obo = /data/go.obo
emdb_empiar_list = /data/emdb_empiar.json
sifts = /data/sifts/
alphafold_ftp = /data/accession_ids.txt
uniprot_tab = /data/uniprot.tsv
[api]
pmc = https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST
[params]
minimal_map_fragment_length = 15
Note: The paths in the Docker config should match the container mount points (e.g., /data/...), not the host paths.
To use the tools and scripts in this repository, you just need to clone it and ensure the config.ini file is properly configured as described above.
Execute the scripts independently in the following recommended order:
fetch_empiar.py: python fetch_empiar.py -w <output_dir_to_store_annotated_empiar_files> -f <path_to_empiar_metadata_files>
fetch_pubmed.py: python fetch_pubmed.py -w <output_dir_to_store_annotated_pubmed_files> -f <path_to_emdb_metadata_files>
added_annotations.py: python added_annotations.py -w <output_dir_to_store_added_annotations> -f <path_to_emdb_metadata_files> --all -t <number_of_threads>
fetch_afdb.py: python fetch_afdb.py -w <output_dir_to_store_annotated_alphafdb_files>
write_xml.py: python write_xml.py <output_dir_to_store_EMICSS_xml_files>
When running the scripts in Docker, you need to mount your data directories and config file as read-only volumes. The general pattern is:
docker run --rm \
-v /path/on/host/config.ini:/config/config.ini:ro \
-v /path/on/host/data:/data:ro \
-v /path/on/host/output:/output \
added-annotations python <script_name.py> <arguments>-v /path/on/host/config.ini:/config/config.ini:ro- Mount your config file as read-only-v /path/on/host/data:/data:ro- Mount your data directory containing all required files (cpx, components.cif, etc.) as read-only-v /path/on/host/output:/output- Mount output directory for writing results (read-write)
Important:
- Use
:roflag for read-only mounts on data and config to prevent accidental modifications - Ensure your config.ini uses container paths (e.g.,
/data/...) that match your volume mounts - Map all directories referenced in your config.ini file to appropriate container paths
Execute the scripts independently in the following recommended order:
docker run --rm \
-v /path/on/host/config.ini:/config/config.ini:ro \
-v /path/on/host/empiar_metadata:/empiar_metadata:ro \
-v /path/on/host/output:/output \
added-annotations python fetch_empiar.py -w /output -f /empiar_metadatadocker run --rm \
-v /path/on/host/config.ini:/config/config.ini:ro \
-v /path/on/host/emdb_metadata:/emdb_metadata:ro \
-v /path/on/host/output:/output \
added-annotations python fetch_pubmed.py -w /output -f /emdb_metadatadocker run --rm \
-v /path/on/host/config.ini:/config/config.ini:ro \
-v /path/on/host/data:/data:ro \
-v /path/on/host/emdb_metadata:/emdb_metadata:ro \
-v /path/on/host/output:/output \
added-annotations python AddedAnnotations.py -w /output -f /emdb_metadata --all -t 4docker run --rm \
-v /path/on/host/config.ini:/config/config.ini:ro \
-v /path/on/host/data:/data:ro \
-v /path/on/host/output:/output \
added-annotations python fetch_afdb.py -w /outputdocker run --rm \
-v /path/on/host/config.ini:/config/config.ini:ro \
-v /path/on/host/output:/output \
added-annotations python generate_eupmc_links.pydocker run --rm \
-v /path/on/host/config.ini:/config/config.ini:ro \
-v /path/on/host/latest:/latest:ro \
-v /path/on/host/previous:/previous:ro \
added-annotations python compare_release.py /latest /previousdocker run --rm \
-v /path/on/host/output:/output \
added-annotations python write_xml.py /outputFor more information about EMICSS, visit the official EMICSS website (https://www.ebi.ac.uk/emdb/emicss). This page provides detailed information about the EMDB/EMICSS project.