ETL pipeline for processing DrugBank data and aligning entities and identifiers with RTX-KG2 concepts. The resulting output is intended for use as training or evaluation data in downstream knowledge graph tasks.
- Create an account at https://go.drugbank.com
- Run:
This will place the DrugBank XML file in the
./download_data.sh
data/directory.
You will need a compatible node synonymizer SQLite database for the KG version you are using.
-
Recommended: Ask a team member for a local copy of the node synonymizer database
(this is useful if you do not have access to the RTX database server). -
Alternative: Request access to the RTX database server (
arax-databases.rtx.ai).
If access is granted, the scripts will automatically download the appropriate node synonymizer database when it is not found locally.
⚠️ The node synonymizer version must match the KG version passed via--kg-version.
The scripts first check for a local database and only attempt a download if it is not already available.
Create and activate a conda environment:
conda create --name drugbank_ner python=3.11.10
conda activate drugbank_nerInstall required packages:
pip install xmltodict==0.14.2
pip install pandas==2.2.3
pip install spacy==3.8.2
pip install scispacy==0.5.5Check your CUDA version:
nvidia-smiThen install the matching CuPy package:
pip install cupy-cuda<your_cuda_version>xpip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_lg-0.5.3.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_scibert-0.5.3.tar.gzImportant
As of the latest refactor, all scripts require the--kg-versionargument.
This ensures that downloaded databases and alignment logic are consistent with the intended knowledge graph version.
Run perform_NER.py to perform named entity recognition and concept alignment on DrugBank text fields.
python perform_NER.py --kg-version 2.10.2--db-host Database file host (default: arax-databases.rtx.ai)
--db-username Database file username (default: rtxconfig)
--db-port Database file port (default: 22)
--ssh-key Path to SSH private key (optional; uses SSH agent if omitted)
--ssh-password SSH password (optional; prefer key or agent; can also set SSH_PASSWORD env var)
--out-dir Output directory for downloaded database files (default: ./data)
Example:
python perform_NER.py \
--kg-version 2.10.2 \
--out-dir ./data \
--ssh-key ~/.ssh/id_rsaRun look_for_identifiers.py to extract, synonymize, and align DrugBank identifiers with RTX-KG2.
python look_for_identifiers.py --kg-version 2.10.2This script supports the same optional connection and output arguments as perform_NER.py.
After successfully running both scripts, the final aligned output will be written to:
./data/DrugBank_aligned_with_KG2.json
- The
--kg-versionargument must follow the formatX.Y.Z(e.g.2.10.2) - Ensure all downloaded database artifacts correspond to the specified KG version
- SSH key–based authentication is strongly recommended over password-based access