With today’s technology, storage of genome sequence data relies heavily on compression, using techniques such as Lempil, ziv and gziv, which are commonly stored in the file formats .bam or .sam forms. Current techniques use standard reference genomes, such as HG19, compiled from a variety of human genomes. The results of many small reads are aligned and stored along with their quality data stores. The impact of whole genome sequencing, particularly in clinical treatment of cancer, will rapidly consume available storage. In 2010, 13 million Americans had cancer; with the existing technology, a single whole genome sequence for each person would be 39 exabyte’s, equal to 39,000 petabytes, 39 million terabytes or 39 billion gigabytes. There simply isn't a storage system that large, as storage capacity only grows at a rate of less than 20% per year.
Researchers at UC Santa Cruz have developed Genomic Deduplication, which could shrink the set of whole genome sequences to under 1 petabyte. The invention solves the problem of storage capacity, removes redundancy, and allows genomic data to consume less data storage space. It is estimated that a typical whole genome sequence of a human will require approximately 300GB of storage using this scheme. Two additional benefits of Genomic Deduplication are the improved processing efficiency as the deduplication library remains in memory and is referenced quickly, rather than reading data from the disk into memory, and elimination of the need for a standard reference genome. The invention therefore solves the problem of storage capacity, removing redundancy and allowing genomic data to consume less data storage space.
|United States Of America||Issued Patent||9,886,561||02/06/2018||2014-456|
Additional Patent Pending
Genomics, genomic sequence, data storage, genomics data storage, Genomic Deduplication, genome sequence data