tools for reading, writing, generating, merging, and remapping SNPs 🧬
snps strives to be an easy-to-use and accessible open-source library for working with
genotype data
- Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing sources with a SNPs object
- Read and write VCF files (e.g., convert 23andMe to VCF)
- Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
- Read data in a variety of formats (e.g., files, bytes, compressed with
gziporzip) - Handle several variations of file types, historically validated using data from openSNP
- Generate synthetic genotype data for testing and examples
- Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
- Remap SNPs between builds / assemblies
- Perform quality control (QC) / filter low quality SNPs based on chip clusters
- Fix several common issues when loading SNPs
- Sort SNPs based on chromosome and position
- Deduplicate RSIDs
- Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
- Deduplicate alleles on MT
- Assign PAR SNPs to the X or Y chromosome
- Derive sex from SNPs
- Detect deduced genotype / chip array and chip version based on chip clusters
- Predict ancestry from SNPs (when installed with ezancestry)
snps supports VCF files and
genotype files from the following DNA testing sources:
- 23andMe
- 23Mofang
- Ancestry
- CircleDNA
- Código 46
- DNA.Land
- Family Tree DNA
- Genes for Good
- LivingDNA
- Mapmygenome
- MyHeritage
- PLINK
- Sano Genetics
- SelfDecode
- tellmeGen
Additionally, snps can read a variety of "generic" CSV and TSV files.
snps requires Python 3.9+ and the following Python
packages:
snps is available on the
Python Package Index. Install snps (and its required
Python dependencies) via pip:
$ pip install snpsFor ancestry prediction
capability, snps can be installed with ezancestry:
$ pip install snps[ezancestry]To try these examples, first generate some sample data:
>>> from snps.resources import Resources
>>> paths = Resources().create_example_datasets()Load a raw data file exported from a DNA testing source (e.g., 23andMe, AncestryDNA, Family Tree DNA):
>>> from snps import SNPs
>>> s = SNPs("resources/sample1.23andme.txt.gz")snps automatically detects the source format and normalizes the data:
>>> s.source
'23andMe'
>>> s.count
991767
>>> s.build
37
>>> s.assembly
'GRCh37'The SNPs are available as a pandas.DataFrame:
>>> df = s.snps
>>> df.columns.tolist()
['chrom', 'pos', 'genotype']
>>> len(df)
991767Combine SNPs from multiple files (e.g., combine data from different testing companies):
>>> results = s.merge([SNPs("resources/sample2.ftdna.csv.gz")])
>>> s.count
1006949SNPs are compared during the merge. Position and genotype discrepancies are identified and
can be inspected via properties of the SNPs object:
>>> len(s.discrepant_merge_positions)
27
>>> len(s.discrepant_merge_genotypes)
156Convert SNPs between genome assemblies (Build 36/NCBI36, Build 37/GRCh37, Build 38/GRCh38):
>>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38)
>>> s.assembly
'GRCh38'Save SNPs to common file formats:
>>> _ = s.to_tsv("output.txt")
>>> _ = s.to_csv("output.csv")To save as VCF, snps automatically downloads the required reference sequences for the
assembly. This ensures the REF alleles in the VCF are accurate:
>>> _ = s.to_vcf("output.vcf") # doctest: +SKIPAll output files are saved to the output directory.
Generate synthetic genotype data for testing, examples, or demonstrations:
>>> from snps.io import SyntheticSNPGenerator
>>> gen = SyntheticSNPGenerator(build=37, seed=123)
>>> gen.save_as_23andme("synthetic_23andme.txt.gz", num_snps=10000)
'synthetic_23andme.txt.gz'The generator supports multiple output formats (23andMe, AncestryDNA, FTDNA) and automatically injects build-specific marker SNPs to ensure accurate build detection.
Documentation is available here.
Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, Open Humans, and Sano Genetics. This project was historically validated using data from openSNP.
snps incorporates code and concepts generated with the assistance of various
generative AI tools (including but not limited to ChatGPT,
Grok, and Claude). ✨
snps is licensed under the BSD 3-Clause License.
