ClaMS
ClaMS is a sequence composition-based classifier for metagenomic sequences.
ClaMS works by capturing signatures of each sequence based on the
sequence composition. Each sequence is modeled as a walk in a de
Bruijn graph with underlying Markov chain properties. ClaMS
captures stationary parameters of the underlying Markov chain as well
as structural parameters of the underlying de Bruijn graph to
form this signature. In practice, for each sequence to binned, such a
signature is computed and matched to similar signatures computed for
the training sets. The best match that also qualifies the normalized
distance cut-off wins. In the case that the best match does not qualify
this cut-off, the sequence remains un-binned.
GUI version of ClaMS
The screenshots below demostrate the workings of the UI version of ClaMS.
Figure 1. Screenshot of the GUI-based implementation of ClaMS.
Launching ClaMS
Following are the steps to use the application ClaMS-UI.
ClaMS-CLI can be launched in the same manner.
Adding a training set
ClaMS can be trained either by using isolate genome sequences
(whose signatures are pre-packaged with the program)
or by using user-specified training sequences.
The top right panel facilitates the addition
of training datasets.
Isolate-genome training sets can be specified at any taxonomic level
by selecting a node in the taxonomy under the "Existing taxonomy" tab
(Figures 2 and 3).
Figure 2. Selecting a training set from the existing taxonomy.
Figure 3. Selecting Burkholderiales as a training set from the existing taxonomy.
All training sets selected in any manner appear as icons under the
"User-defined training sets" tab (Figure 4).
Figure 4. Icons for training sets.
A training set can also be uploaded as a fasta/multi-fasta file by the user
by clicking on the "Add training set" button under the
"User-defined training sets" tab and selecting a file in the filesystem
(Figures 5 and 6).
ClaMS works best with training sequences that are 1000 bases or longer.
For accurate binning, sequences need to be at least 500 bases long.
Figure 5. Uploading a user-defined training set.
Figure 6. Uploading a user-defined training set.
Setting input parameters
The bottom right panel
facilitates the adjustment of parameters such as normalized distance cut-off,
type of signature, and word length.
Type of signature: Currently, two kinds of genomic signatures are
supported by ClaMS:
Lenwood S. Heath, Amrita Pati, "Genomic signatures in de Bruijn
chains". WABI 2007, LNBI 4645 pp. 216-227, 2007.
This signature can be computed for any k-mer length. Currently, ClaMS supports
the computation of the DBC signature for k-mer lengths 2-4.
A DBC genomic signature of order k utilizes properties of the underlying
de Bruijn chain of order k and can be represented as a 2*4^k-long vector
of real numbers.
In ClaMS, DBC signatures are compared by computing the Pearson distance
between them.
Distance cutoff: A slider facilitates setting of a distance cutoff
when DBC signatures are used. In the case of DOR signatures being used,
an automatic cutoff of 65 is used and the slider is disabled.
k-mer size: DBC signatures can be used at a k-mer scale of 2, 3, or 4.
Visualizing results
Figure 7. Results grid of ClaMS binning with three training sets.
Figure 8. Results pie-chart of ClaMS binning with three training sets.
Running the command line version of ClaMS
For running just one round of binning, the ClaMS command-line version can be used as follows:
USAGE: java -Xmx<custom memory for JVM heap> -jar ClaMS-CLI.jar
<Tab delimited file of training sequence files>
<Fasta file to bin>
<Output file>
<Signtype DBC/DOR>
<K-mer length 2/3/4>
<Confidence cutoff 0.01 recommended>
iClaMS
A related development is iterative ClaMS (iClaMS).
iClaMS consists of a Perl wrapper script
that separates a sequence set into bins by training
on the sequence set itself.
Directions for running both ClaMS and iClaMS are included
in the README in the tar archive.
The current version of iClaMS only trains on
sequences in the dataset that are longer than 10 kb.
In future versions, this number can be changed according to your requirements.
It is best to trust bins only for sequences longer than 400-500 bps.
Availability and Download
Download
Questions/Comments:
![]() Version 0.1 Dec 2010 ©2010 Metagenome Program, JGI The Regents of the University of California. The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
|