Skip to content
/ snps Public

tools for reading, writing, generating, merging, and remapping SNPs

License

Notifications You must be signed in to change notification settings

apriha/snps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,202 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

snps

CI codecov docs pypi python downloads Ruff

snps

tools for reading, writing, generating, merging, and remapping SNPs 🧬

snps strives to be an easy-to-use and accessible open-source library for working with genotype data

Features

Input / Output

  • Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing sources with a SNPs object
  • Read and write VCF files (e.g., convert 23andMe to VCF)
  • Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
  • Read data in a variety of formats (e.g., files, bytes, compressed with gzip or zip)
  • Handle several variations of file types, historically validated using data from openSNP
  • Generate synthetic genotype data for testing and examples

Build / Assembly Detection and Remapping

  • Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
  • Remap SNPs between builds / assemblies

Data Cleaning

  • Perform quality control (QC) / filter low quality SNPs based on chip clusters
  • Fix several common issues when loading SNPs
  • Sort SNPs based on chromosome and position
  • Deduplicate RSIDs
  • Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
  • Deduplicate alleles on MT
  • Assign PAR SNPs to the X or Y chromosome

Analysis

  • Derive sex from SNPs
  • Detect deduced genotype / chip array and chip version based on chip clusters
  • Predict ancestry from SNPs (when installed with ezancestry)

Supported Genotype Files

snps supports VCF files and genotype files from the following DNA testing sources:

Additionally, snps can read a variety of "generic" CSV and TSV files.

Dependencies

snps requires Python 3.9+ and the following Python packages:

Installation

snps is available on the Python Package Index. Install snps (and its required Python dependencies) via pip:

$ pip install snps

For ancestry prediction capability, snps can be installed with ezancestry:

$ pip install snps[ezancestry]

Examples

To try these examples, first generate some sample data:

>>> from snps.resources import Resources
>>> paths = Resources().create_example_datasets()

Load a Raw Data File

Load a raw data file exported from a DNA testing source (e.g., 23andMe, AncestryDNA, Family Tree DNA):

>>> from snps import SNPs
>>> s = SNPs("resources/sample1.23andme.txt.gz")

snps automatically detects the source format and normalizes the data:

>>> s.source
'23andMe'
>>> s.count
991767
>>> s.build
37
>>> s.assembly
'GRCh37'

The SNPs are available as a pandas.DataFrame:

>>> df = s.snps
>>> df.columns.tolist()
['chrom', 'pos', 'genotype']
>>> len(df)
991767

Merge Raw Data Files

Combine SNPs from multiple files (e.g., combine data from different testing companies):

>>> results = s.merge([SNPs("resources/sample2.ftdna.csv.gz")])
>>> s.count
1006949

SNPs are compared during the merge. Position and genotype discrepancies are identified and can be inspected via properties of the SNPs object:

>>> len(s.discrepant_merge_positions)
27
>>> len(s.discrepant_merge_genotypes)
156

Remap SNPs

Convert SNPs between genome assemblies (Build 36/NCBI36, Build 37/GRCh37, Build 38/GRCh38):

>>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38)
>>> s.assembly
'GRCh38'

Save SNPs

Save SNPs to common file formats:

>>> _ = s.to_tsv("output.txt")
>>> _ = s.to_csv("output.csv")

To save as VCF, snps automatically downloads the required reference sequences for the assembly. This ensures the REF alleles in the VCF are accurate:

>>> _ = s.to_vcf("output.vcf")  # doctest: +SKIP

All output files are saved to the output directory.

Generate Synthetic Data

Generate synthetic genotype data for testing, examples, or demonstrations:

>>> from snps.io import SyntheticSNPGenerator
>>> gen = SyntheticSNPGenerator(build=37, seed=123)
>>> gen.save_as_23andme("synthetic_23andme.txt.gz", num_snps=10000)
'synthetic_23andme.txt.gz'

The generator supports multiple output formats (23andMe, AncestryDNA, FTDNA) and automatically injects build-specific marker SNPs to ensure accurate build detection.

Documentation

Documentation is available here.

Acknowledgements

Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, Open Humans, and Sano Genetics. This project was historically validated using data from openSNP.

snps incorporates code and concepts generated with the assistance of various generative AI tools (including but not limited to ChatGPT, Grok, and Claude). ✨

License

snps is licensed under the BSD 3-Clause License.

About

tools for reading, writing, generating, merging, and remapping SNPs

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors 11

Languages