Tutorial

How to use the Web Server
The web server provides all of the options available in the standalone version, plus some additional ones for visualization and comparative analysis. The various features and detailed instructions for using the web server are provided below.
Application Page
From this page the user can submit and carry out taxonomic analysis of either sequence reads or Blastx output. Since MetaBin uses a homology-based approach, alignments with a reference database (NCBI NR) are required. Two programs, Blat and Blastx, are provided to generate the alignments. Blat is much faster (up to 1000 times) as compared to Blastx and thus can dramatically reduce the amount of time taken to generate the alignments. The results between the two alignment options are comparable. When submitting sequence reads as input, we recommend the Blat option to obtain faster results. The first option BLAT shown below uses Blat for generating the alignments. Users must submit their sequence reads as input in the specified format (see format section below for details). Open Reading Frames (ORFs) predicted in the reads are aligned against the NCBI NR database or the 'NR minus Eukaryotes' database by Blat. The 'NR minus Eukaryotes' database is constructed by removing all the proteins which are found exclusively in the eukaryotic genomes. This option is useful if the user wishes to determine only the microbial composition in the uploaded metagenomic data, which is also the main aim of most metagenomic studies. The Blat alignment results are then analyzed using MetaBin, and the reads are classified into taxonomic bins.
The second option BLAST shown below uses Blastx for generating the alignments. Here we recommend the users to run the Blastx job (full format) on their own machine, preferably using multiple node/processors, and then upload the Blastx alignment output to our server for carrying out the taxonomic assignments. An option to upload sequence reads is also provided, but it will take long time to generate the alignments as Blastx is very slow in comparison to Blat.
The results are provided in several files for download, as shown below.
The 'thumb.png' file listed above provides a thumbnail image of the taxonomic tree (sample.reads.blastx.png) which is shown below.
A summary of the assignments is shown as a Table below. Any taxonomic rank or complete table can be selected for display. The columns can be sorted by clicking on any of the headers.
The *.json file obtained after the analysis can be uploaded from the ‘Visualization’ page (explained below) to view it on our server and to perform various comparative analysis.
Visualization Page:
This page provides several options for viewing the MetaBin results (*.json file) on our server. If a user has run the MetaBin standalone version, they can visualize their results here and perform various comparative analyses. The first option ‘Create Taxonomic Tree’ can be used to visualize the taxonomic tree and prepare a ‘Composition chart’ for a single dataset.

The results for a single metagenomic dataset are shown below.

The composition chart shows the abundance of microbes classified at various taxonomic levels and provides an overview of the microbial distribution in the metagenomic dataset.
The full view of the taxonomic tree is shown below.
The second option ‘Compare Metagenome Profiles’ can be used to compare the taxonomic profiles of up to five metagenomic datasets. The input is *.json files for each dataset which is obtained after running MetaBin.
The results for comparison of multiple metagenomic profiles are shown below. The composition chart shows the 'abundance of microbes' classified at various taxonomic levels in different metagenomes represented in different. The 'abundance of microbes' for any rank is computed by dividing the total number of reads assigned to that rank by the total number of assigned reads. It compares the microbial distribution in the metagenomic datasets.
Usage Policy
Currently we can analyze 100 MB reads data per day using the web server. At present, the maximum size limit for uploading a reads file is set to 10 MB, and 500 MB for Blastx file, at one time. If you wish to analyze larger files, please contact metabin@riken.jp.
We understand that your data may be confidential. It is safe at our server and can be accessed only by the administrator. It is stored locally on our system, only for one week after the analysis is complete is deleted after it.
HOW TO USE THE STANDALONE PROGRAM
How to install:
The installation instructions are provided on the Download page.
Instructions for running MetaBin

How to prepare the input file for MetaBin using Blat as an alignment method

./prepareinput -i <filename> -b <y|n>

filename : This refers to the name of the file containing sequence reads in valid FASTA format (See format section below for details).

-b : if you wish to run Blat seperately then choose the 'n'. In this case, please use the *.orf file and use it to run Blat using these options.

/path/bin/blatv34/blat <database> <input(*.orf)> -prot -out=blast <outputfile>

In the other case, use option '-b y' to run blat using the configuration provided in 'config' file. A 'config' file is available in the metabin directory. Please specify the path of blat executable and NCBI NR database. The output of Blat (blast like full alignment format) is used as the input of 'metabin' program.
Command line usage for running MetaBin
./metabin -i <infile> -a <alignment method> -n <read file> [-s <min. bit-score>] [-b <bin size>] [-r <bit-score range>] [-p <read format>]

-i infile: Specify the Blastx or Blat output file name obtained after running Blastx or Blat, respectively, on the metagenomic sequence dataset (FASTA reads).

-a alignment method: Specify the alignment method used (blat or blast).

-n read file: Specify the name of the reads file (input file name).

-s min. bit-score: Refers to the bit-score of the hit in the Blastx or Blat output for a given read. Users can provide a minimum bit-score cut-off which will be used to consider only the hits that are above or equal to that value. The default bit-score, which will be used in case no bit-score cut-off is provided by the user, is 29 for blastx and 17 for blat, .

-b bin size: Minimum number of reads needed to form a taxonomic bin. The default bin size is one read.

-r bit-score range: Refers to the range of bit-scores from the top bit-score to be considered for taxonomic assignment. The default range considers all hits 90% or more of the highest bit-score. For example, if the highest bit-score was 100, all hits with a bit-score in the range 90-100 will be considered for taxonomic assignment.

-p read format: By adding this option to the command line, users can specify that the sequence reads were generated as paired-ends and should be considered in pairs for taxonomic assignment. The two currently supported formats are 'sanger' and 'mbformat'. The mbformat is a simple format used by MetaBin, as described later. The default format is single read.

-d compare multiple metagenomes: This option is provided for comparative analysis of multiple metagenomes. By using this option, users can provide the *.JSON files obtained after running MetaBin on the individual metagenomes for comparative analysis of their taxonomic profiles. It will generate a dendrogram showing the proportion of each taxonomic bin as a pie-chart placed in the taxonomic tree. Usage of this option is described below. ./metabin –d <infile1.json> <infile2.json> [<infile3.json> …]

Note: After running MetaBin once on a file, the next run (using the same or different input parameters) will be much faster, as MetaBin saves the formatted file after parsing the BLASTX file, thus eliminating the parsing step, in the next runs which consumes most of the total time taken in running MetaBin. However, if it seems that the program has terminated in between runs, please delete the exisiting *.format file and then re-run MetaBin.

Recommendation on running BLASTX on your metagenomic sequences
MetaBin requires BLASTX output as its input. The recommended parameters used to run BLASTX are: word size adjustment ‘–W 2 –f 8’, soft filtering setting ‘-F "m S"’ and expectation value ‘-E 100’ to include short matches. We recommend these parameters while running BLASTX for comprehensive taxonomic assignments using the MetaBin algorithm.
FILE FORMATS AND DESCRIPTION
Input file format for using either Blat or Blastx as the alignment method
The reads files must consist of nucleotides, and must be in a valid FASTA format with no duplicate reads. The *.doc files cannot be analyzed by our system. 16s rRNA sequences cannot be analyzed by MetaBin. The FASTA format for the reads file is shown below. The annotation line should only use the underscore “_” or "|" characters as the separators or delimiters as shown below. If the annotation line contains any spaces, it will be converted to "_" in the final output.
>r1|info1|_info2_info3…
ATGCTAGCTGATGGGCTGATAGTCGTGAT
The ‘info’ in the above example is any information which the user wishes to specify in the annotation line.
Input file format for using Blastx output for taxonomic assignment
The output file generated after running Blastx on your metagenomic sequences should be provided as is.
Paired-end reads format (Sanger)
For Sanger sequencers, ABI 3730 and MegaBACE 4500 (GE Health), b and z indicate forward reads, while g and y indicate reverse reads, used as follows.
b,z: forward (b:3730, z:MegaBACE)
g,y: reverse (g:3730, y:MegaBACE)
The input file format (fasta) for reads should be as shown below.
>*_A01.y
AAGTGACACGCGTACGTAGCAGATCTTCCCGGGTGATTCCNGGCGGGCTTNAATCATTTTTGCGACTGGCACCCG
>*_A01.z
AGCTGTCAGGCAGCTGCGCAGGATCTAGGCCTGAAGCTGCCCTTGATGATTTCACATTGCGTTTTCTCAATCTCCGC
>*_A02.y
TTTTTTTTTTTTTTAATGTGCACGCGTCGTAGCAGATCTTCCCGGGTGAATAAACCGAAATANGTTAAATTTTACGG
>*_A02.z
AGCGCAGGCTAGCTTGCGCAGGAATCTAGGCCTGAAGCTTGTCAGCCGCCGTNAGCACTTCATCAGCTCCTTCGCC
Paired-end reads format (Illumina)
For Illumina sequences, please use the following format.
>Seq1_Annot_1
AAGTGACACGCGTACGTAGCAGATCTTCCCGGGTGATTCCNGGCGGGCTTNAATCATTTTTGCGACTGGCACCCG
>Seq1_Annot_2
AGCTGTCAGGCAGCTGCGCAGGATCTAGGCCTGAAGCTGCCCTTGATGATTTCACATTGCGTTTTCTCAATCTCCGC
>Seq2_Annot_1
TTTTTTTTTTTTTTAATGTGCACGCGTCGTAGCAGATCTTCCCGGGTGAATAAACCGAAATANGTTAAATTTTACGG
>Seq2_Annot_2
AGCGCAGGCTAGCTTGCGCAGGAATCTAGGCCTGAAGCTTGTCAGCCGCCGTNAGCACTTCATCAGCTCCTTCGCC
Paired-end reads format as fixed by MetaBin (mbformat)
We have defined a simple format which can be used to annotate your input sequences to be recognized as paired-end reads by MetaBin. The format is shown below.
>r1.1_*
AAGTGACACGCGTACGTAGCAGATCTTCCCGGGTGATTCCNGGCGGGCTTNAATCATTTTTGCGACTGGCACCCG
>r1.2_*
AGCTGTCAGGCAGCTGCGCAGGATCTAGGCCTGAAGCTGCCCTTGATGATTTCACATTGCGTTTTCTCAATCTCCGC
>r2.1_*
TTTTTTTTTTTTTTAATGTGCACGCGTCGTAGCAGATCTTCCCGGGTGAATAAACCGAAATANGTTAAATTTTACGG
>r2.2_*
AGCGCAGGCTAGCTTGCGCAGGAATCTAGGCCTGAAGCTTGTCAGCCGCCGTNAGCACTTCATCAGCTCCTTCGCC
In the above format, the reads are annotated by the prefix ‘>r’, followed by the number of the pair, followed by the member of the pair, separated by a ‘.’
For example in the above cases, ‘>r1.1’ indicates the paired end read 1 and similarly ‘>r1.2’ refers to the other member of the pair.

In the above two formats, * indicates the rest of the annotation information which could be user defined.
OUTPUT FILES

Descriptions of the output files.

1. *.format: This file is generated after parsing the Blat or Blastx output file. It contains an alignment summary in tabular format. Once this file is generated for an input file, it will not be regenarated again, for the same input file in the next run even if the input parameters such as bit-score cut-off, etc., are changed. If the user wants to parse the BLASTX file again, then the user must delete the pre-existing *.format file. This file is not provided for download as part of the web server analysis (beacuse it is big in size).

2. *.sum: This file provides a summary of the sum of the number of reads assigned to the various taxonomic bins. The four columns are:

'Taxonomic bin'   'Taxonomic rank'   'Number of reads assigned directly to the bin'   'Sum of all reads assigned to the bin'

The last line summarizes the total number of reads in the input file, and the number of reads assigned to each of the taxonomic bins.

3. *.sum.reads: This file provides detailed information about which reads were assigned to which taxonomic group. The four columns are:

'Taxonomic bin'   'Taxonomic rank'   'Sum of all reads assigned to the bin' 'Annotation of the read(s) assigned'

4. *.bin.reads: This file provides the binning summary of the reads which are directly assigned to a taxonomic bin. The three columns are:

'Taxonomic bin'   'Number of reads assigned directly to the bin'   'Annotation of the read(s) assigned'

5. *.png: This is the image file of the taxonomic tree created for a dataset. This file can be viewed in most image viewing applications.

6.*.json: This file contains the results of the analysis and information necessary to plot the taxonomic tree.

This file is needed as an input file for carrying out the comparative analysis of two or more metagenomic datasets using MetaBin. From the VISUALIZATION page, it can be uploaded for constructing the taxonomic tree and for comparative analysis.

The command line option for comparative analysis of two or more datasets (maximum five) is as follows.

(./metabin -d "<infile1> <infile2> [<infile3> ...]").

The double quotes ".." , as shown above, are necessary for the execution of this command.

Descriptions of the output files generated using the -p option (standalone program) are shown below.

*.blastx.bin.reads: This file summarizes the binning before re-assignment.

*.bin.reads.mp: This file summarizes the re-assigned bins information (format is same as *.blastx.bin.reads file).

*.bin.reads.mp.log: This file summarizes the taxonomic re-assignment. A description of the output file is provided below.

#Paired-end read 1;Paired-end read 2 Taxonomic bin Taxonomic lineage Taxonomic reassignment

An example of the output is shown below

#gnl_ti_2097946941_name:F1-S001_A01.y;gnl_ti_2097946942_name:F1-S001_A01.z Same bin Same lineage -

Description of the output file generated using the -d option (standalone program) is shown below.

A_B_*_comp.png: This image file contains the taxonomic tree prepared by comparing the *.json files of two or more metagenomic datasets, for example A and B. The first ten characters of the input files, seperated by '_', are used to name the output file.

Descriptions of using the -f option (standalone program) to update the taxonomy database.

This command is used to create a new taxonomy database using the taxonomy information available at NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/).
Please download the file taxdump.tar.gz or taxdmp.zip. Uncompress this file. Provide the path of the uncompressed directory as shown below.
./metabin -f fullpathofdirectory/
The output database file is tax.db which will be made in the ./data directory. The pre-existing tax.db file in this directory will be moved to tax.db.old.

Comparison of MetaBin with other available programs
Performance of MetaBin
Time taken on a real metagenomic dataset containing more than 20 million reads
The performance of MetaBin was validated on real metagenomic data obtained by Illumina sequencing from a Spanish male individual (V1CD2, age 49, BMI 27.76, 20,707,369 high quality reads, library 090107 (ftp://public.genomics.org.cn/BGI/gutmeta/High_quality_reads/)). The ‘prepareinput’ program was used to predict the ORFs, the output file containing the predicted ORFs was then divided into 22 parts, and each part was aligned against NR using Blat on 22 processors, each with 16 GB of RAM, and the outputs were concatenated into a single file (149 GB). MetaBin (the ‘metabin’ program) was then run on a single processor (with 32 GB of RAM) to carry out the taxonomic assignments. Only those bins containing at least 10,000 reads were considered, while the rest of the parameters used the default values. In total, it took about 370 CPU hours, which is significantly reasonable considering the input size of more than 20.72 million reads. We estimated that for the same number of reads, the Blastx program would have taken almost two years on a single processor using similar system configuration as used above in case of Blat.

Time taken to analyze the Blastx output
Using the Blat option to generate alignments reduces the run time up to 1000 times as compared to Blastx. While Blastx took 46 hours on an 8-core processor, for a total of 46x8= 368 hours, to annalyze 30,000 reads, Blat only took about 21 minutes for the same job using a single processor.

We also compared the time taken to analyze the Blastx output using both MetaBin and MEGAN. A summary is shown below.

Work flow of MetaBin (Figure A)
Calculation of weight and criteria for reassigning taxonomic bin to paired-end reads on the basis of weight (Figure B):
Taxonomic assignment of a read containing multiple ORFs using MetaBin (Figure C)

A, B, C, D, E and F are the genomes (taxonomic ID) to which the ORFs showed matches: