MetaBin: a program for accurate, fast and highly sensitive taxonomic assignments of metagenomic sequences

For comprehensive taxonomic binning, we developed the ‘MetaBin’ web server and standalone program for faster and more accurate taxonomic assignment of single and paired-end sequence reads of varying lengths (≥45 bp) obtained from both Sanger and next-generation sequencing platforms. We benchmarked it using both simulated reads (> 1 million) and real metagenomic datasets. MetaBin correctly assigns a higher number of reads to their expected taxonomic lineages with a lower error frequency as compared to other methods. It displays high accuracy (positive predictive value (PPV) ≥99%) along with high sensitivity (≥94%) for various read lengths. In particular, for short Illumina reads (~45-75 bp) it makes about 4% more assignments as compared to its closest competitors with near 100% accuracy when reference genomes are available.

By implementing Blat a faster alignment method as opposed to Blastx (though both options are available), the analysis time is reduced by 50-1000 times, which is comparable or faster than the time taken for analysis by usually faster composition-based methods. This feature makes it practical to use a more accurate and sensitive homology-based approach for high-throughput analysis of large datasets by removing the bottleneck of time required to generate alignments using Blastx. The MetaBin web server allows users to upload their own data, as sequence reads or Blastx output, to carry out taxonomic analysis. It provides several visualization options for constructing a taxonomic tree of the results, and for performing comparative analysis of the taxonomic profiles for multiple metagenomic datasets. A standalone command line version is also available ( and strongly recommended for high-throughput analysis at user’s end.

Need for taxonomic binning

Metagenomics has emerged as a powerful culture-independent approach for exploring the complexity and diversity of microbial genomes in their natural environments. Globally, several hundred metagenomic projects are either ongoing or are in the planning stages. These projects generate huge amounts of sequence reads of various lengths depending upon the methodology used. Though the primary aim of these studies is usually to capture the entire microbial community that exists in an environment, the methodology used generates a complex mixture of short genomic sequences derived from several different genomes found within that environment. The first, and primary, challenge in metagenomic data analysis is to ascertain the genomic origin of metagenomic sequences and to make appropriate taxonomic assignments. The situation becomes more complicated when many of the sequences come from novel or yet uncultured species, for which their genomes are not well represented in the reference databases.

Motivation to develop a better, comprehensive and more efficient algorithm

The present methods mainly use either composition- or homology-based approaches for the taxonomic assignment of sequence reads. Though the homology-based methods are more sensitive and accurate, they suffer primarily due to the time needed to generate the Blast alignments. The motivation to develop a better homology-based algorithm for taxonomic classification came from the fact that none of the currently available methods are comprehensive enough, nor have they considered some key features of metagenomic sequences which could result in increased and more accurate taxonomic assignments. Therefore, we provide this novel web server and program called ‘MetaBin’, which exploits the information from all possible ORFs (complete or partial) for each sequence read when carrying out the taxonomic assignment.

Genomic DNA fractioned into reads of different lengths derived from various regions. (A) Read from intergenic region, (B) read containing small 5’ region of an ORF, (C) read containing two partial ORFs at the 5’and 3’ terminals and a complete ORF in the middle, (D) read containing only a single complete ORF, (E) read containing a long partial ORF at one end, (F) read obtained from completely within the ORF, (G) read with sequencing error causing a single ORF to split into two smaller ORFs.