The Annals of Applied Statistics

Deconvolution of base pair level RNA-Seq read counts for quantification of transcript expression levels

Han Wu and Yu Zhu

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


RNA-Seq has emerged as the method of choice for profiling the transcriptomes of organisms. In particular, it aims to quantify the expression levels of transcripts using short nucleotide sequences or short reads generated from RNA-Seq experiments. In real experiments, the label of the transcript, from which each short read is generated, is missing, and short reads are mapped to the genome rather than the transcriptome. Therefore, the quantification of transcript expression levels is an indirect statistical inference problem.

In this article, we propose to use individual exonic base pairs as observation units and, further, to model nonzero as well as zero counts at all base pairs at both the transcript and gene levels. At the transcript level, two-component Poisson mixture distributions are postulated, which gives rise to the Convolution of Poisson mixture (CPM) distribution model at the gene level. The maximum likelihood estimation method equipped with the EM algorithm is used to estimate model parameters and quantify transcript expression levels. We refer to the proposed method as CPM-Seq. Both simulation studies and real data demonstrate the effectiveness of CPM-Seq, showing that CPM-Seq produces more accurate and consistent quantification results than Cufflinks.

Article information

Ann. Appl. Stat., Volume 10, Number 3 (2016), 1195-1216.

Received: October 2013
Revised: October 2015
First available in Project Euclid: 28 September 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

RNA-Seq transcriptome profiling finite Poisson mixture model convolution


Wu, Han; Zhu, Yu. Deconvolution of base pair level RNA-Seq read counts for quantification of transcript expression levels. Ann. Appl. Stat. 10 (2016), no. 3, 1195--1216. doi:10.1214/16-AOAS906.

Export citation


  • Au, K. F., Jiang, H., Lin, L., Xing, Y. and Wong, W. H. (2010). Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38 4570–4578.
  • Griebel, T., Zacher, B., Ribeca, P., Raineri, E., Lacroix, V., Guigó, R. and Sammeth, M. (2012). Modelling and simulating generic RNA-seq experiments with the flux simulator. Nucleic Acids Res. 40 10073–10083.
  • Hu, M., Zhu, Y., Taylor, J. M. G., Liu, J. S. and Qin, Z. S. (2012). Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-seq. Bioinformatics 28 63–68.
  • Kim, H., Bi, Y., Pal, S., Gupta, R. and Davuluri, R. V. (2011). IsoformEx: Isoform level gene expression estimation using weighted non-negative least squares from mRNA-seq data. BMC Bioinformatics 12 305.
  • Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 R25.
  • Li, B. and Dewey, C. N. (2011). RSEM: Accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12 323.
  • Li, W., Feng, J. and Jiang, T. (2011). IsoLasso: A LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol. 18 1693–1707.
  • Li, J. J., Jiang, C.-R., Brown, J. B., Huang, H. and Bickel, P. J. (2011). Sparse linear modeling of next-generation mRNA sequencing (RNA-seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108 19867–19872.
  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5 621–628.
  • Salzman, J., Jiang, H. and Wong, W. H. (2011). Statistical modeling of RNA-Seq data. Statist. Sci. 26 62–83.
  • Srivastava, S. and Chen, L. (2010). A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res. 38 e170.
  • Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). TopHat: Discovering splice junctions with RNA-seq. Bioinformatics 25 1105–1111.
  • Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28 511–515.
  • Varin, C., Reid, N. and Firth, D. (2011). An overview of composite likelihood methods. Statist. Sinica 21 5–42.
  • Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P. and Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456 470–476.
  • Wu, H., Qin, Z. S. and Zhu, Y. (2012). PM-Seq: Using finite Poisson mixture models for RNA-seq data analysis and transcript expression level quantification. Statistics in Biosciences 5 71–87.
  • Wu, H. and Zhu, Y. (2016). Supplement to “Deconvolution of base pair level RNA-seq read counts for quantification of transcript expression levels.” DOI:10.1214/16-AOAS906SUPP.
  • Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. and Stoica, I. (2010). Spark: Cluster computing with working sets. In HotCloud’10 Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing 10–10. USENIX Association, Berkeley, CA.

Supplemental materials

  • Supplementary document for deconvolution of base pair level RNA-Seq read counts for quantification of transcript expression levels. We provide a supplementary document to show the details of the Poisson mixture distribution, the conditional distribution of $y_{m}^{r}$, the distribution of the illustrative example, the composite likelihood function, the details of the EM algorithm, the quantification method, supporting figures for the illustrative example, quantification results for MCF7, and the supporting figure for Example 1.