Assemblies of next generation sequencing (NGS) data, while accurate, still contain a substantial number of errors that need to be corrected after the assembly process. Earlier assembly algorithms developed for Sanger sequencing follow an "overlap - layout - consensus" paradigm, where consensus refers to fixing errors in the contigs. Since this paradigm faces difficulties in short read assembly, most NGS assemblers employ a de Bruijn graph approach that effectively deals with large amounts of data. However, most NGS assemblers neglect the consensus step, i.e., there exists no postprocessing of the contigs in Velvet and many other popular assemblers. Relying on high and uniform coverage, NGS assembly algorithms push the burden of producing high quality assemblies onto the construction of the de Bruijn graph. Our work demonstrates that NGS assemblers can benefit from the use of a consensus step. There are currently no tools that aim to accomplish this same goal.
UCSD researchers have recently developed a method and companion software, SEQuel, to correct errors (i.e., insertions, deletions, and substitution errors) in the assembled contigs of NGS data. Fundamental of SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model. SEQuel takes as input an assembled contig, the paired-end reads that align to that contig and the approximate positions where they aligned, and returns a refined contig.
Correcting errors in contigs from high throughput sequencing (HTS) assemblies. These might include bacterial/plant/vertebrate genomes that were not been previously sequenced, or the products of transcript assembly.
The development stage of SEQuel is complete, and initial tests on several datasets have already shown its utility in correcting errors in assembled genomes. When SEQuel was applied to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and it made over 800 changes (insertions, deletions and substitutions) to refine this assembly.