ClaMS is a sequence composition-based classifier for metagenomic sequences. ClaMS works by capturing signatures of each sequence based on the sequence composition. Each sequence is modeled as a walk in a de Bruijn graph with underlying Markov chain properties. ClaMS captures stationary parameters of the underlying Markov chain as well as structural parameters of the underlying de Bruijn graph to form this signature. In practice, for each sequence to binned, such a signature is computed and matched to similar signatures computed for the training sets. The best match that also qualifies the normalized distance cut-off wins. In the case that the best match does not qualify this cut-off, the sequence remains un-binned.
GUI version of ClaMS
The screenshots below demostrate the workings of the UI version of ClaMS.
Figure 1. Screenshot of the GUI-based implementation of ClaMS.
Following are the steps to use the application ClaMS-UI.
Adding a training set
ClaMS can be trained either by using isolate genome sequences (whose signatures are pre-packaged with the program) or by using user-specified training sequences. The top right panel facilitates the addition of training datasets. Isolate-genome training sets can be specified at any taxonomic level by selecting a node in the taxonomy under the "Existing taxonomy" tab (Figures 2 and 3).
Figure 2. Selecting a training set from the existing taxonomy.
Figure 3. Selecting Burkholderiales as a training set from the existing taxonomy.
All training sets selected in any manner appear as icons under the "User-defined training sets" tab (Figure 4).
Figure 4. Icons for training sets.
A training set can also be uploaded as a fasta/multi-fasta file by the user by clicking on the "Add training set" button under the "User-defined training sets" tab and selecting a file in the filesystem (Figures 5 and 6). ClaMS works best with training sequences that are 1000 bases or longer. For accurate binning, sequences need to be at least 500 bases long.
Figure 5. Uploading a user-defined training set.
Figure 6. Uploading a user-defined training set.
Setting input parameters
The bottom right panel facilitates the adjustment of parameters such as normalized distance cut-off, type of signature, and word length.
Type of signature: Currently, two kinds of genomic signatures are supported by ClaMS:
Lenwood S. Heath, Amrita Pati, "Genomic signatures in de Bruijn chains". WABI 2007, LNBI 4645 pp. 216-227, 2007.
This signature can be computed for any k-mer length. Currently, ClaMS supports the computation of the DBC signature for k-mer lengths 2-4. A DBC genomic signature of order k utilizes properties of the underlying de Bruijn chain of order k and can be represented as a 2*4^k-long vector of real numbers. In ClaMS, DBC signatures are compared by computing the Pearson distance between them.
Distance cutoff: A slider facilitates setting of a distance cutoff when DBC signatures are used. In the case of DOR signatures being used, an automatic cutoff of 65 is used and the slider is disabled.
k-mer size: DBC signatures can be used at a k-mer scale of 2, 3, or 4.
Figure 7. Results grid of ClaMS binning with three training sets.
Figure 8. Results pie-chart of ClaMS binning with three training sets.
Running the command line version of ClaMS
For running just one round of binning, the ClaMS command-line version can be used as follows:
USAGE: java -Xmx<custom memory for JVM heap> -jar ClaMS-CLI.jar
<Tab delimited file of training sequence files>
<Fasta file to bin>
<K-mer length 2/3/4>
<Confidence cutoff 0.01 recommended>
A related development is iterative ClaMS (iClaMS). iClaMS consists of a Perl wrapper script that separates a sequence set into bins by training on the sequence set itself. Directions for running both ClaMS and iClaMS are included in the README in the tar archive. The current version of iClaMS only trains on sequences in the dataset that are longer than 10 kb. In future versions, this number can be changed according to your requirements. It is best to trust bins only for sequences longer than 400-500 bps.
Availability and Download
Version 0.1 Dec 2010
©2010 Metagenome Program, JGI
The Regents of the University of California.
The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.