MLgsc - Maximum-likelihood general sequence classifier

Summary

MLgsc is a set of programs for classifying sequences into taxa (in other words, recognizing taxa from sequences).

Compared to other programs, MLgsc

can work on any conserved region (in particular, not just 16S rRNA)
can work on any set of taxa (not just prokaryotes)
can work on nucleic acid and protein sequences
has a simple command-line interface
has no dependencies to other packages (except possibly if you install from source)

A technical description of MLgsc can be found in the article.

Example

Here is a short example of MLgsc in action, in which we classify the sequences in file data/queries.pep (which is just a Fasta file) according to a classifier for endospore-forming Firmicutes based on the protein sequence of the Spo0A gene (file data/Spo0A.bcls):

$ mlgsc data/queries.pep data/Spo0A.bcls | head

ID_001 Bacillus -> Bacilli (56); Bacillaceae (97); Bacillus (87)
ID_002 Clostridium -> Clostridia (106); Clostridiales (138); Clostridium (83)
ID_003 Clostridium -> Clostridia (106); Clostridiales (138); Clostridium (83)
ID_004 Bacillus -> Bacilli (79); Bacillaceae (95); Bacillus (99)
ID_005 Bacillus -> Bacilli (84); Bacillaceae (89); Bacillus (104)
ID_006 Bacillus -> Bacilli (84); Bacillaceae (89); Bacillus (104)
ID_007 Exiguobacterium -> Bacilli (80); Exiguobacterium (279)
ID_008 Clostridium -> Clostridia (131); Clostridiales (200); Clostridium (179)
ID_009 Sporomusa -> Clostridia (74); Sporomusa (30)
ID_010 Clostridium -> Clostridia (170); Clostridiales (208); Clostridium (168)

The output has one line per query sequence. Each line contains the query's name (query_001, etc.), followed by an arrow, and finally by mlgsc's classification of the query. The classification lists the taxa the query is inferred to belong to, in order of increasing specificity; and each taxon name is followed by a measure of confidence (the higher the better) in parentheses.

For instance, query_001 is inferred to belong to genus Bacillus, itself within family Bacillaceae, which in turn belongs to class Bacilli. In this example, confidence values above 10 are fairly reliable (for more about confidence measures, see below), so all queries are confidently classified at all levels.

Downloading and Installing

The package is available for download from GitHub. You can either download it from the GitHub website using the "Download ZIP" button, or from the shell by cding to where you want to download, and doing:

$ git clone https://github.com/tjunier/mlgsc.git
$ cd mlgsc

The distribution contains both binaries (Linux x86-64) and source code.

Binaries

The package contains the following binaries (a.k.a "executables"):

mlgsc, the classifier program itself
mlgsc_train, used to train models for use with mlgsc
mlgsc_xval, used to validate models
mlgsc_dump, a helper program that prints information about models

The binaries are found in subdirectory src of the distribution. They can be installed by issuing:

$ cd src
$ sudo make install

at the top-level directory of the distribution. This will install them in /usr/local/bin.

That's it. You can jump to the tutorial.

Source

Compiling from source isn't extremely complicated. Basically, it involves:

Making sure you have a (recent) Haskell Compiler
(possibly) Installing additional libraries
Building

Installing Haskell

MLgsc is written in the Haskell language. You will need a Haskell compiler and libraries; the easiest way of obtaining this is to install the Haskell platform on your machine - it is available as a precompiled package for several major Linux distribution including Ubuntu, Fedora, Mint, etc.

I use the Glasgow Haskell Compiler version 7.10.2, you may still be able to compile with a slightly older version. To see which version you have, do

$ ghc --version

The Glorious Glasgow Haskell Compilation System, version 7.10.2

Additional Libraries

Libraries not included in the Haskell platform may be installed with the cabal tool (which comes with the platform). The following command will install all additional packages (you may need to run this as root):

$ cabal install vector containers text binary text-binary array \
  parsec learning-hmm random-fu tuple random HUnit process directory \
  filepath MissingH vector-binary

Building

Issue the following commands in the top-level directory of the source distribution:

$ runhaskell Setup.hs configure # may need --user
$ cabal build
$ sudo cabal install --global

Or, if you have GNU make, you can also just do

$ make
$ sudo make install

This will install the programs in /usr/local/bin.

Note: I'm still trying to figure out how to package the thing for use with Cabal. Until then, I find it easier to use good old Make.

Tutorial

This section will walk you through a complete example of using MLgsc to build a model and use it to classify sequences.

File data/unknown.pep contains sequences of the Spo0A protein, the master regulator of sporulation in some bacteria. It looks like this:

$ head data/unknown.pep

>query_001
IMPHLDGLAVLERLRESQLKKQPNVIMLTAFGQEDVTKKAVDLGASYFILKPFDMENLVG
HIRQVSGNGSQLTHRAPSSQSSVLRPQPESPKKNLDASITTIIHEIGVPAHIKGYLYLRE
AISMVYNDIELLGSITKVLYPDIAKKFNTTASRVERAIRHAIEVAWSRGNIDSISSLFGY
TVSMSKAKPTNSEFIAMVAD
>query_002
IMPHLDGLGVLEKIGATAISKRPLFIILSAVGQDKITQRALALGAEYYVVKPFDMEVLIS
RIRQLKNVNQPNVIRQDGLSGEVKSSYHPPQPKNLEAEVTNIMHEIGVPAHIKGYQYLRD
AIIMVVKDLDVINSITKQLYPTIAKEYNTTPSRVERAIRHAIEVAWSRGQIDTIDSLFGY
TINVGKGKPTNSEFIAMVAD

Our task is to determine which genus each sequence belongs to. For this, we need reference sequences, that is, Spo0A sequences from bacteria of known genus. Spo0A is only found in Firmicutes, a clade of bacteria that includes well-known genera like Bacillus and Clostridium. Here are the first few line of a file with aligned Spo0A sequences from 38 genera of Firmicutes:

$ head data/reference.msa

>ID_001 Bacillus
IMPHLDGLAVLERLRE-SQLKK-QPN-VIMLTAFGQEDVTKKAVDLGASYFILKPFDMEN
LVGHIRQVSGNGSQL--THRAPS---SQSS------------VLR-PQPES------PKK
NLDASITTIIHEIGVPAHIKGYLYLREAISMVYNDIELLGSITKVLYPDIAKKFNTTASR
VERAIRHAIEVAWSRGNIDSISSLFGYTVSMSKAKPTNSEFIAMVAD
>ID_002 Clostridium
IMPHLDGLGVLEKIGA-TAISK-RPL-FIILSAVGQDKITQRALALGAEYYVVKPFDMEV
LISRIRQLKNVNQPN--VIRQ----------------DGLSGEVKSSYHPP------QPK
NLEAEVTNIMHEIGVPAHIKGYQYLRDAIIMVVKDLDVINSITKQLYPTIAKEYNTTPSR
VERAIRHAIEVAWSRGQIDTIDSLFGYTINVGKGKPTNSEFIAMVAD

Notice that the headers of each Fasta entry contains, after the ID, the name of the genus the sequence belongs to: Bacillus for ID_001, Clostridium for ID_002, and so on. These genera are our target taxa: given a sequence from one of these genera, we expect our classifier to identify the genus it belongs to - even if the sequence was not among the reference sequences.

MLgsc also needs a phylogeny of the reference taxa. This is mostly to speed up the computation, the details are in the article.

One way of supplying this information is with a taxonomy file, such as this one (another is with a phylogenetic tree, see in the Reference part below):

$ head data/firmicutes.taxo

Firmicutes; Clostridia; Sporomusa
Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; Anaerofustis
Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; Eubacterium
Firmicutes; Clostridia; Clostridiales; Eubacteriaceae
Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; Anaerostipes
Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; Dorea
Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; Marvinbryantia
Firmicutes; Clostridia; Clostridiales; Lachnospiraceae
Firmicutes; Clostridia; Clostridiales; Clostridium
Firmicutes; Clostridia; Clostridiales; Peptococcaceae; Dehalobacter

This file simply states that, for example, genus Anaerofustis (line 2) belongs to family Eubacteriaceae, which belongs to order Clostridiales, which is in class Clostridia, which is itself a subclade of phylum Firmicutes.

Training

We can now train a model for Spo0A in our target taxa.

$ mlgsc_train -T t Prot data/reference.msa data/firmicutes.taxo

The output of this command is our Spo0A model for Firmicutes, contained in file data/reference.bcls. We can now use it to classify our unknown sequences:

Classifying

$ mlgsc data/unknown.pep data/reference.bcls | head

query_001 -> Firmicutes (*); Bacilli (56); Bacillaceae (97); Bacillus (87)
query_002 -> Firmicutes (*); Clostridia (106); Clostridiales (138); Clostridium (83)
query_003 -> Firmicutes (*); Clostridia (106); Clostridiales (138); Clostridium (83)
query_004 -> Firmicutes (*); Bacilli (80); Bacillaceae (95); Bacillus (99)
query_005 -> Firmicutes (*); Bacilli (84); Bacillaceae (89); Bacillus (104)
query_006 -> Firmicutes (*); Bacilli (84); Bacillaceae (89); Bacillus (104)
query_007 -> Firmicutes (*); Bacilli (79); Exiguobacterium (281)
query_008 -> Firmicutes (*); Clostridia (131); Clostridiales (200); Clostridium (179)
query_009 -> Firmicutes (*); Clostridia (73); Sporomusa (30)
query_010 -> Firmicutes (*); Clostridia (170); Clostridiales (208); Clostridium (168)

This means, for example, that query_001 is inferred by MLgsc to belong to the Firmicutes, and within these, to the Bacilli; within the Bacilli, to the Bacillaceae, and finally to genus Bacillus.

The values between parentheses are measures of confidence. Higher values mean higher confidence, and values above 1000 are shown as an asterisk, indicating virtual certainty. In this case, values above 10 are already quite reliable (we show below how we determine this). This means that, unsurprisingly, all the queries shown above are classified to genus level with high confidence.

As a check, let's classify a set of sequences whose genus is known:

$ mlgsc -f "%h -> %p"  data/queries.pep data/reference.bcls | head

ID_001 Bacillus -> Firmicutes (*); Bacilli (56); Bacillaceae (97); Bacillus (87)
ID_002 Clostridium -> Firmicutes (*); Clostridia (106); Clostridiales (138); Clostridium (83)
ID_003 Clostridium -> Firmicutes (*); Clostridia (106); Clostridiales (138); Clostridium (83)
ID_004 Bacillus -> Firmicutes (*); Bacilli (80); Bacillaceae (95); Bacillus (99)
ID_005 Bacillus -> Firmicutes (*); Bacilli (84); Bacillaceae (89); Bacillus (104)
ID_006 Bacillus -> Firmicutes (*); Bacilli (84); Bacillaceae (89); Bacillus (104)
ID_007 Exiguobacterium -> Firmicutes (*); Bacilli (79); Exiguobacterium (281)
ID_008 Clostridium -> Firmicutes (*); Clostridia (131); Clostridiales (200); Clostridium (179)
ID_009 Sporomusa -> Firmicutes (*); Clostridia (73); Sporomusa (30)
ID_010 Clostridium -> Firmicutes (*); Clostridia (170); Clostridiales (208); Clostridium (168)

I pass option -f "%h -> %p, which causes the full header (%h) to be printed out, instead of only the ID.

We see that the classifications are correct. But this check is not very strong, because the test sequences were all part of the training set. MLgsc comes with a program (mlgsc_xval) that does cross-validation with disjoint training and test sets. This is discussed just below.

Validation

mlgsc_xval performs leave-one-out (LOO) cross-validation: it randomly draws one sequence (referred to as the test sequence) from the reference set and builds the model using all the remaining sequences (called the training set). Then it classifies the test sequence agaisnt the model. By repeating the procedure, one obtains an estimate of MLgsc's error rate. Let's see how this works:

$ mlgsc_xval -x -T t Prot data/reference.msa data/firmicutes.taxo
ID_353 Anaerostipes (196) -> Firmicutes (*); Clostridia (102); Clostridiales (182); Clostridium (12)
ID_339 Clostridium (212) -> Firmicutes (*); Clostridia (108); Clostridiales (172); Lachnospiraceae (18); Anaerostipes (125)

This performs a hundred runs of LOO, using the same input data as for training the model above. Since we're mostly interested in the failure rate, I pass option -x, which causes mlgsc_xval to only display failures. Here there are two lines, which means that two out of a hundred runs failed, that is, the estimated error rate is 2%.

Name		Name	Last commit message	Last commit date
Latest commit History 535 Commits
data		data
img		img
man		man
src		src
test		test
.gitignore		.gitignore
BUGS.md		BUGS.md
LICENSE		LICENSE
MLGSC.cabal		MLGSC.cabal
Makefile		Makefile
README-long.md		README-long.md
README.esmd		README.esmd
README.md		README.md
Setup.hs		Setup.hs
TODO.md		TODO.md
cabal.config		cabal.config
stack.yaml		stack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLgsc - Maximum-likelihood general sequence classifier

Summary

Example

Downloading and Installing

Binaries

Source

Installing Haskell

Additional Libraries

Building

Tutorial

Training

Classifying

Validation

About

Uh oh!

Releases

Packages

Languages

License

tjunier/mlgsc

Folders and files

Latest commit

History

Repository files navigation

MLgsc - Maximum-likelihood general sequence classifier

Summary

Example

Downloading and Installing

Binaries

Source

Installing Haskell

Additional Libraries

Building

Tutorial

Training

Classifying

Validation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages