1. What is ClaMS
  2. Forms of availability of ClaMS
  3. Using GUI-based ClaMS
    1. Launching
    2. Training
    3. Setting input parameters
    4. Visualizing the results
  4. Using CLI-based ClaMS
  5. Using iClaMS
  6. Availability and Download

ClaMS
ClaMS is a sequence composition-based classifier for metagenomic sequences. ClaMS works by capturing signatures of each sequence based on the sequence composition. Each sequence is modeled as a walk in a de Bruijn graph with underlying Markov chain properties. ClaMS captures stationary parameters of the underlying Markov chain as well as structural parameters of the underlying de Bruijn graph to form this signature. In practice, for each sequence to binned, such a signature is computed and matched to similar signatures computed for the training sets. The best match that also qualifies the normalized distance cut-off wins. In the case that the best match does not qualify this cut-off, the sequence remains un-binned.

GUI version of ClaMS
The screenshots below demostrate the workings of the UI version of ClaMS.
Figure 1. Screenshot of the GUI-based implementation of ClaMS.

Launching ClaMS
Following are the steps to use the application ClaMS-UI.

  1. Have java runtime environment installed on your computer.
  2. Save the .jar file somewhere.
  3. You could open the app by just double clicking on the jar file, but this initializes the Java Virtual Machine (JVM) with the default parameters and those might not be accurate. Bring up the command prompt and navigate to the directory in which the jar is stored. Then launch the application by typing the following: java -Xmx2000M -jar ClaMS-UI.jar If this throws error messages to the console and does not bring up the app, please try reducing the memory requirements of the JVM to 1000M or so.
ClaMS-CLI can be launched in the same manner.

Adding a training set
ClaMS can be trained either by using isolate genome sequences (whose signatures are pre-packaged with the program) or by using user-specified training sequences. The top right panel facilitates the addition of training datasets. Isolate-genome training sets can be specified at any taxonomic level by selecting a node in the taxonomy under the "Existing taxonomy" tab (Figures 2 and 3).
Figure 2. Selecting a training set from the existing taxonomy.


Figure 3. Selecting Burkholderiales as a training set from the existing taxonomy.

All training sets selected in any manner appear as icons under the "User-defined training sets" tab (Figure 4).
Figure 4. Icons for training sets.

A training set can also be uploaded as a fasta/multi-fasta file by the user by clicking on the "Add training set" button under the "User-defined training sets" tab and selecting a file in the filesystem (Figures 5 and 6). ClaMS works best with training sequences that are 1000 bases or longer. For accurate binning, sequences need to be at least 500 bases long.
Figure 5. Uploading a user-defined training set.


Figure 6. Uploading a user-defined training set.

Setting input parameters
The bottom right panel facilitates the adjustment of parameters such as normalized distance cut-off, type of signature, and word length.

Type of signature: Currently, two kinds of genomic signatures are supported by ClaMS:


Distance cutoff: A slider facilitates setting of a distance cutoff when DBC signatures are used. In the case of DOR signatures being used, an automatic cutoff of 65 is used and the slider is disabled.

k-mer size: DBC signatures can be used at a k-mer scale of 2, 3, or 4.

Visualizing results

Figure 7. Results grid of ClaMS binning with three training sets.


Figure 8. Results pie-chart of ClaMS binning with three training sets.

Running the command line version of ClaMS
For running just one round of binning, the ClaMS command-line version can be used as follows:

USAGE: java -Xmx<custom memory for JVM heap> -jar ClaMS-CLI.jar
<Tab delimited file of training sequence files>
<Fasta file to bin>
<Output file>
<Signtype DBC/DOR>
<K-mer length 2/3/4>
<Confidence cutoff 0.01 recommended>


iClaMS
A related development is iterative ClaMS (iClaMS). iClaMS consists of a Perl wrapper script that separates a sequence set into bins by training on the sequence set itself. Directions for running both ClaMS and iClaMS are included in the README in the tar archive. The current version of iClaMS only trains on sequences in the dataset that are longer than 10 kb. In future versions, this number can be changed according to your requirements. It is best to trust bins only for sequences longer than 400-500 bps.

Availability and Download
Download





Questions/Comments:
Version 0.1 Dec 2010
©2010 Metagenome Program, JGI
The Regents of the University of California.

The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.