Seqsum

Robust checksums for nucleotide sequences. Accepts data from either standard input or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual checksums for each sequence, and an aggregate checksum-of-checksums for a collection of sequences. Warnings are shown for duplicate sequences and within-collection checksum collisions at the chosen bit depth. Sequences are uppercased prior to hashing with xxHash (xxh3_128) and may be normalised (with -n) to use only the characters ACGTN-. Read IDs and FASTQ base quality scores do not inform the checksum. Outputs tab delimited text or JSON to stdout.

A typical use case is determining whether reordered, renamed or otherwise bit-inexact fasta/fastq files have equivalent sequence composition. Another use is generating the shortest possible collision-free identifiers for sequence collections.

By default, seqsum outputs both individual and aggregate checksums when supplied with more than one sequence. This can be modified with the flags --individual (-i) or --aggregate (-a).

Install

Installation inside a clean Python 3.10+ virtualenv (or conda environment) is recommended. To open .zst archives, you will also need to pip install zstandard.

pip install seqsum

Development install

git clone https://github.com/bede/seqsum.git
cd seqsum
pip install --editable '.[dev]'
pytest

Command line usage

# Fasta with one record
$ seqsum nt MN908947.fasta
MN908947.3	ca5e95436b957f93

# Fasta with two records
$ seqsum nt MN908947-BA_2_86_1.fasta
MN908947.3	ca5e95436b957f93
BA.2.86.1		d5f014ee6745cb77
aggregate	837cfd6836b9a406

# Fasta with two records, only show the aggregate checksum
$ seqsum nt -a MN908947-BA_2_86_1.fasta
aggregate	837cfd6836b9a406

# Fasta via stdin
% cat MN908947.fasta | seqsum nt -
MN908947.3	ca5e95436b957f93

# Fastq (gzipped) with 1m records, redirected to file, with progress bar
$ seqsum nt illumina.r12.fastq.gz --progress > checksums.tsv
Processed 1000000 records (317kit/s)
INFO: Found duplicate sequences

Built-in help

$ seqsum nt -h                       
usage: seqsum nt [-h] [-n] [-s] [-b BITS] [-i] [-a] [-j] [-p] input

Robust individual and aggregate checksums for nucleotide sequences. Accepts input
from either stdin or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual
checksums for each sequence, and an aggregate checksum-of-checksums for a
collection of sequences. Warnings are shown for duplicate sequences and
within-collection checksum collisions at the chosen bit depth. Sequences are
uppercased prior to hashing with xxHash and may optionally be normalised to use only
the characters ACGTN-. Read IDs and base quality scores do not inform the checksum

positional arguments:
  input                 path to fasta/q file (or - for stdin)

options:
  -h, --help            show this help message and exit
  -n, --normalise       replace U with T and characters other than ACGT- with N
                        (default: False)
  -s, --strict          raise error for characters other than ABCDGHKMNRSTVWY-
                        (default: False)
  -b BITS, --bits BITS  displayed checksum length
                        (default: 64)
  -i, --individual      output only individual checksums
                        (default: False)
  -a, --aggregate       output only aggregate checksum
                        (default: False)
  -j, --json            output JSON
                        (default: False)
  -p, --progress        show progress and speed
                        (default: False)

Python usage

from seqsum import lib

checksums, aggregate_checksum = lib.sum_nt(">read1\nACGT")
print(checksums)
# {'read1': '81db282b97e7dfd1'}

from pathlib import Path
from seqsum import lib

fasta_path = Path("tests/data/MN908947-BA_2_86_1.fasta")
checksums, aggregate_checksum = lib.sum_nt(fasta_path)
print(checksums)
print(aggregate_checksum)
# {'MN908947.3': 'ca5e95436b957f93', 'BA.2.86.1': 'd5f014ee6745cb77'}
# 837cfd6836b9a406

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
src/seqsum		src/seqsum
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seqsum

Install

Command line usage

Python usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

bede/seqsum

Folders and files

Latest commit

History

Repository files navigation

Seqsum

Install

Command line usage

Python usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages