The Annals of Probability

Power laws for family sizes in a duplication model

Rick Durrett and Jason Schweinsberg

Full-text: Open access


Qian, Luscombe and Gerstein [J. Molecular Biol. 313 (2001) 673–681] introduced a model of the diversification of protein folds in a genome that we may formulate as follows. Consider a multitype Yule process starting with one individual in which there are no deaths and each individual gives birth to a new individual at rate 1. When a new individual is born, it has the same type as its parent with probability 1−r and is a new type, different from all previously observed types, with probability r. We refer to individuals with the same type as families and provide an approximation to the joint distribution of family sizes when the population size reaches N. We also show that if 1≪SN1−r, then the number of families of size at least S is approximately CNS−1/(1−r), while if N1−rS the distribution decays more rapidly than any power.

Article information

Ann. Probab., Volume 33, Number 6 (2005), 2094-2126.

First available in Project Euclid: 7 December 2005

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 60J80: Branching processes (Galton-Watson, birth-and-death, etc.)
Secondary: 60J85: Applications of branching processes [See also 92Dxx] 92D15: Problems related to evolution 92D20: Protein sequences, DNA sequences

Power law Yule processes multitype branching processes genome sequencing


Durrett, Rick; Schweinsberg, Jason. Power laws for family sizes in a duplication model. Ann. Probab. 33 (2005), no. 6, 2094--2126. doi:10.1214/009117905000000369.

Export citation


  • Aldous, D. J. (2001). Stochastic models and descriptive statistics for phylogenetic trees from Yule to today. Statist. Sci. 16 23--34.
  • Angerer, W. P. (2001). An explicit representation of the Luria--Delbrück distribution. J. Math. Biol. 42 145--174.
  • Angerer, W. P. and Wakolbinger, A. (2005). In preparation.
  • Arratia, R., Barbour, A. D. and Tavaré, S. (2000). Limits of logarithmic combinatorial structures. Ann. Probab. 28 1620--1644.
  • Arratia, R. and Gordon, L. (1989). Tutorial on large deviations for the binomial distribution. Bull. Math. Biol. 51 125--131.
  • Athreya, K. B. and Karlin, S. (1968). Embedding of urn schemes into continuous time Markov branching processes. Ann. Math. Statist. 39 1801--1817.
  • Athreya, K. B. and Ney, P. E. (1972). Branching Processes. Springer, Berlin.
  • Barbási, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286 509--512.
  • Berger, N., Borgs, C., Chayes, J. and Saberi, A. (2005). On the spread of viruses on the internet. In Proceedings of the 16th ACM---SIAM Symposium on Discrete Algorithms 301--310. SIAM, Philadelphia, PA.
  • Bollobás, B., Borgs, C., Chayes, J. and Riordan, O. (2003). Directed scale free graphs. In Proceedings of the 14th ACM---SIAM Symposium on Discrete Algorithms 132--139. SIAM, Philadelphia, PA.
  • Bollobás, B., Riordan, O., Spencer, J. and Tusnády, G. (2001). The degree sequence of a scale-free random graph process. Random Structures Algorithms 18 279--290.
  • Cooper, C. and Frieze, A. (2003). A general model for web graphs. Random Structures Algorithms 22 311--335.
  • Durrett, R. (1996). Probability: Theory and Examples, 2nd ed. Duxbury, Belmont, CA.
  • Durrett, R. (2002). Probability Models for DNA Sequence Evolution. Springer, New York.
  • Durrett, R. and Schweinsberg, J. (2004). Approximating selective sweeps. Theoret. Population Biol. 66 129--138.
  • Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theoret. Population Biol. 3 87--112.
  • Gu, Z., Cavalcanti, A., Chen, F.-C., Bouman, P. and Li, W.-H. (2002). Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol. Biol. Evol. 19 256--262.
  • Harrison, P. M. and Gerstein, M. (2002). Studying genomes through the aeons: Protein families, pseudogenes, and proteome evolution. J. Mol. Biol. 318 1155--1174.
  • Huynen, M. A. and van Nimwegen, E. (1998). The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15 583--589.
  • Janson, S. (2004). Functional limit theorems for multitype branching processes and generalized Pólya urns. Stochastic Process. Appl. 110 177--245.
  • Janson, S. (2004). Limit theorems for triangular urn schemes. Preprint. Available at http://www.
  • Johnson, N. L., Kotz, S. and Kemp, A. W. (1992). Univariate Discrete Distributions. Wiley, New York.
  • Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezov, F. S. and Koonin, E. V. (2002). Birth and death of protein domains: A simple model explains power law behavior. BMC Evolutionary Biology 2 article 18.
  • Kingman, J. F. C. (1982). The coalescent. Stochastic Process. Appl. 13 235--248.
  • Koonin, E. V., Wolf, Y. I. and Karev, G. P. (2002). The structure of the protein universe and genome evolution. Nature 420 218--223.
  • Krapivsky, P. L., Redner S. and Leyvraz, F. (2000). Connectivity of growing random networks. Phys. Rev. Lett. 85 4629--4632.
  • Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A. and Upfal, E. (2000). Stochastic models for the web graph. In Proceedings of the 41st IEEE Symposium on the Foundations of Computer Science 57--65.
  • Li, W.-H., Gu, Z., Wang, H. and Nekrutenko, A. (2001). Evolutionary analyses of the human genome. Nature 409 847--849.
  • Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probab. Theory Related Fields 102 145--158.
  • Pitman, J. (2002). Combinatorial stochastic processes. Lecture Notes for St. Flour Summer School. Available at
  • Pitman, J. and Yor, M. (1997). The two-parameter Poisson--Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25 855--900.
  • Qian, J., Luscombe, N. M. and Gerstein, M. (2001). Protein family and fold occurrence in genomes: Power-law behavior and evolutionary model. J. Mol. Biol. 313 673--681.
  • Rzhetsky, A. and Gomez, S. M. (2001). Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17 988--996.
  • Schweinsberg, J. and Durrett, R. (2004). Random partitions approximating the coalescence of lineages during a selective sweep. Preprint. Available at 0411069.
  • Simon, H. A. (1955). On a class of skew distribution functions. Biometrika 42 425--440.
  • Skorohod, A. V. (1961). Asymptotic formulas for stable distribution laws. Selected Translations in Mathematical Statistics and Probability 1 157--161.
  • Yule, G. U. (1925). A mathematical theory of evolution based on the conclusions of Dr. J. C. Willis. Philos. Trans. Roy. Soc. London Ser. B 213 21--87.