Open Access
February 2008 Approximate word matches between two random sequences
Conrad J. Burden, Miriam R. Kantorovitz, Susan R. Wilson
Ann. Appl. Probab. 18(1): 1-21 (February 2008). DOI: 10.1214/07-AAP452

Abstract

Given two sequences over a finite alphabet $\mathcal{L}$, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Citation

Download Citation

Conrad J. Burden. Miriam R. Kantorovitz. Susan R. Wilson. "Approximate word matches between two random sequences." Ann. Appl. Probab. 18 (1) 1 - 21, February 2008. https://doi.org/10.1214/07-AAP452

Information

Published: February 2008
First available in Project Euclid: 9 January 2008

zbMATH: 1141.60013
MathSciNet: MR2380889
Digital Object Identifier: 10.1214/07-AAP452

Subjects:
Primary: 60F17 , 92D20

Keywords: DNA sequences , Sequence comparison , word matches

Rights: Copyright © 2008 Institute of Mathematical Statistics

Vol.18 • No. 1 • February 2008
Back to Top