Electronic Communications in Probability

Closeness to the diagonal for longest common subsequences in random words

Christian Houdré and Heinrich Matzinger

Full-text: Open access

Abstract

The nature of the alignment with gaps corresponding to a longest common subsequence (LCS) of two independent iid random sequences drawn from a finite alphabet is investigated. It is shown that such an optimal alignment typically matches pieces of similar short-length. This is of importance in understanding the structure of optimal alignments of two sequences. Moreover, it is also shown that any property, common to two subsequences, typically holds in most parts of the optimal alignment whenever this same property holds, with high probability, for strings of similar short-length. Our results should, in particular, prove useful for simulations since they imply that the re-scaled two dimensional representation of a LCS gets uniformly close to the diagonal as the length of the sequences grows without bound.

Article information

Source
Electron. Commun. Probab., Volume 21 (2016), paper no. 36, 19 pp.

Dates
Received: 29 December 2014
Accepted: 20 April 2016
First available in Project Euclid: 27 April 2016

Permanent link to this document
https://projecteuclid.org/euclid.ecp/1461781966

Digital Object Identifier
doi:10.1214/16-ECP4029

Mathematical Reviews number (MathSciNet)
MR3492931

Zentralblatt MATH identifier
1338.05004

Subjects
Primary: 05A05: Permutations, words, matrices 60C05: Combinatorial probability 60F10: Large deviations

Keywords
longest common subsequences optimal alignments last passage percolation edit/Levensthein distance

Rights
Creative Commons Attribution 4.0 International License.

Citation

Houdré, Christian; Matzinger, Heinrich. Closeness to the diagonal for longest common subsequences in random words. Electron. Commun. Probab. 21 (2016), paper no. 36, 19 pp. doi:10.1214/16-ECP4029. https://projecteuclid.org/euclid.ecp/1461781966


Export citation

References

  • [1] Alexander, K. S. The rate of convergence of the mean length of the longest common subsequence, Annals of Applied Probability 4, (1994), 1074–1082.
  • [2] Amsalu, S., Houdré, C. and Matzinger, H., Sparse long blocks and the variance of the length of the longest common subsequences in random words, arXiv:1204.1009 [math.PR], (2012).
  • [3] Amsalu, S., Houdré, C. and Matzinger, H., Sparse long blocks and the micro-structure of the longest common subsequences in random words, J. Stat. Phys., 154, 6, (2014), 1516–1549.
  • [4] Baik, J., Deift, P. and Johansson, K. On the distribution of the length of the longest increasing subsequence of random permutations. J. Amer. Math. Soc., 12(4):1119–1178, 1999.
  • [5] Breton, J.-C. and Houdré, C., On the limiting law of the length of the longest common and increasing subsequences in random words. arXiv:1505.06164 [Math.PR], (2015).
  • [6] Capocelli, R.M., Sequences: Combinatorics, Compression, Security, and Transmission, Springer-Verlag New York, (1989).
  • [7] Chvátal, V. and Sankoff, D., Longest common subsequences of two random sequences, J. Appl. Probability, 12, (1975), 306–315.
  • [8] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1998.
  • [9] Houdré, C. and Işlak, Ü., A central limit theorem for the length of the longest common subsequences in random words, arXiv:1408.1559v3 [math.PR], (2015).
  • [10] Houdré, C., Lember, J. and Matzinger, H., On the longest common increasing binary subsequence, C.R. Acad. Sci. Paris Ser. I 343, (2006), 589–594.
  • [11] Houdré, C. and Litherland, T. L. On the longest increasing subsequence for finite and countable alphabets. HDPV: The Luminy Volume, IMS Collection, 5:185–212, 2009.
  • [12] Houdré, C. and Litherland, T. L. On the limiting shape of Young diagrams associated with Markovian random words. arXiv:1110.4570 [math.PR], (2011).
  • [13] Houdré, C. and Ma, J., On the order of the central moments of the length of the longest common subsequences in random words, To appear: Progress in Probability: Birkhauser (2016).
  • [14] Houdré, C. and Matzinger, H., On the variance of the optimal alignments score for binary random words and an asymmetric scoring function, Under Revision: J. Stat. Phys (2016).
  • [15] Its, A. R., Tracy, C. and Widom, H. Random words, Toeplitz determinants, and integrable systems. I. Random matrix models and their applications. Math. Sci. Res. Inst. Publ., 40, 2001.
  • [16] Its, A. R., Tracy, C. and Widom, H. Random words, Toeplitz determinants, and integrable systems. II. Advances in nonlinear mathematics and science. Phys. D, 152–153:199–224, 2001.
  • [17] Johansson, K. Transversal fluctuations for increasing subsequences on the plane, Probab. Theory Related Fields 116, (2000) 445–456.
  • [18] Lember, J. and Matzinger, H., Standard deviation of the LCS when zero and one have different probabilities, Annals of Probability 37, (2009), 1192–1235.
  • [19] Lueker, G. S., Improved bounds on the average length of longest common subsequences Journal of the ACM, 56, Art. 17, (2009).
  • [20] Robin, S., Rodolphe, F., and Schbath, S., ADN, mots et modèles, Belin, Paris, 2003.
  • [21] Sankoff, D., and Kruskal, J., Time warps, string edits and macromolecules: The theory and practice of sequence comparison Center for the Study of Language and Information, 1999.
  • [22] Seppäläinen, T., Scaling for a one-dimensional directed polymer with boundary conditions, Ann. Probab. 40, (2012) 19–73.
  • [23] Steele, J. M., Long common subsequences and the proximity of two random strings, SIAM J. Appl. Math. 42 (1982), 731–737.
  • [24] Waterman, M. S., Estimating statistical significance of sequence alignments Phil. Trans. R. Soc. Lond. B. (1994), 383–390.
  • [25] Waterman, M. S., Introduction to Computational Biology: Maps, Sequences and Genomes (Interdisciplinary Statistics), CRC Press (2000).