-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Hi,
Thanks for creating modkit. However I am having an issue.
When running modkit pileup (v.0.6.1) with CRAM files in containerized environments (e.g., Nextflow/Singularity pipelines), modkit fails with the error Error! should be at least 1 contig because it cannot resolve the reference genome path embedded in the CRAM header. The --reference flag does not override the header reference path, causing failures even when a valid reference is provided.
What Happens:
- CRAM files contain embedded reference paths in their headers (e.g., pointing to original file locations)
- When modkit processes these CRAMs through Nextflow, even with --reference flag and REF_PATH/REF_CACHE environment variables set, it still attempts to use the reference path from the CRAM header
- If that header reference path doesn't exist in the container/work directory, modkit fails to read contigs
Error message: Error! should be at least 1 contig
I was able to resolve the issue by rewriting the reference path in the CRAM header before running modkit (e.g., updating the @sq UR fields to point to a reference path accessible inside the container).
However, this becomes cumbersome when working with many samples in automated pipelines (e.g., Nextflow).
The modkit command I ran is:
modkit pileup \
sample_ID \
./output/ \
--reference genome_hg38.fa \
--modified-bases 5mC \
--phased \
--combine-strands --cpg \
--threads 12
This might also be an issue with HTSLIB or samtools, but I was wondering is there a way in modkit to:
Force the use of the reference specified with --reference, rr override/ignore the reference path embedded in the CRAM header?
If not, would this be something that could be supported in future releases?
Thank you in advance!