New Method for Similarity Search with Well-Characterized Statistics
Tech ID: 19317 / UC Case 2001-042-0
Similarity search methods are important for data mining in many fields, such as biology, finance, and artificial intelligence. In molecular biology, for example, computer-assisted sequence comparison is widely recognized as an essential analytical tool. The quality of an alignment is usually characterized by an alignment score. Statistical significance (e.g., a p-or E- value) needs to be assigned to the alignment score in order to interpret the result. This statistics assessment is however a difficult and time-consuming task for the existing Smith-Waterman-type or HMM-based algorithms, whenever gaps are involved.
A novel hybrid algorithm has recently been devised so that the alignments it generates obey universal statistics, without sacrificing the sensitivity and computational cost compared to the best of the existing methods. The alignment scores of the hybrid algorithm have been shown to satisfy the so-called Gumbel distribution, with the few Gumbel parameters readily computable, for a large class of scoring functions and for a wide range of sequence lengths and amino acid frequencies. These allow one to rapidly characterize the results of sequence comparison, without the need for extensive simulation for each scoring functions as it is currently done.
The hybrid algorithm has been successfully tested on known classes of protein sequences. For more information, see: Y.-K. Yu and T. Hwa, Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models, J. Comp. Biol., submitted for publication.
- Provides universal statistics for a large class of similarity detection algorithms.
- Score statistics obey Gumbel distribution with lambda=1 for a wide range of gap costs and substitution matrices.
- No need for time-consuming simulations in order to assign statistical significance.
- 1000-fold savings in computer time for statistical evaluation.
- Same sensitivity as best existing methods.
- Similar computational complexity as for existing methods.
- Easily implemented without sacrificing computational efficiency.
- Can easily incorporate BLAST heuristics for large database searches.
- Applicable to position-specific scoring system.
- Extends to many similarity search problems, such as in finance (e.g., correlations among financial time series) and artificial intelligence (e.g., enhanced recognition of speech patterns).