Open Access
September 2016 A hierarchical framework for state-space matrix inference and clustering
Chandler Zuo, Kailei Chen, Kyle J. Hewitt, Emery H. Bresnick, Sündüz Keleş
Ann. Appl. Stat. 10(3): 1348-1372 (September 2016). DOI: 10.1214/16-AOAS938


Integrative analysis of multiple experimental datasets measured over a large number of observational units is the focus of large numbers of contemporary genomic and epigenomic studies. The key objectives of such studies include not only inferring a hidden state of activity for each unit over individual experiments, but also detecting highly associated clusters of units based on their inferred states. Although there are a number of methods tailored for specific datasets, there is currently no state-of-the-art modeling framework for this general class of problems. In this paper, we develop the MBASIC (Matrix Based Analysis for State-space Inference and Clustering) framework. MBASIC consists of two parts: state-space mapping and state-space clustering. In state-space mapping, it maps observations onto a finite state-space, representing the activation states of units across conditions. In state-space clustering, MBASIC incorporates a finite mixture model to cluster the units based on their inferred state-space profiles across all conditions. Both the state-space mapping and clustering can be simultaneously estimated through an Expectation-Maximization algorithm. MBASIC flexibly adapts to a large number of parametric distributions for the observed data, as well as the heterogeneity in replicate experiments. It allows for imposing structural assumptions on each cluster, and enables model selection using information criterion. In our data-driven simulation studies, MBASIC showed significant accuracy in recovering both the underlying state-space variables and clustering structures. We applied MBASIC to two genome research problems using large numbers of datasets from the ENCODE project. The first application grouped genes based on transcription factor occupancy profiles of their promoter regions in two different cell types. The second application focused on identifying groups of loci that are similar to a GATA2 binding site that is functional at its endogenous locus by utilizing transcription factor occupancy data and illustrated applicability of MBASIC in a wide variety of problems. In both studies, MBASIC showed higher levels of raw data fidelity than analyzing these data with a two-step approach using ENCODE results on transcription factor occupancy data.


Download Citation

Chandler Zuo. Kailei Chen. Kyle J. Hewitt. Emery H. Bresnick. Sündüz Keleş. "A hierarchical framework for state-space matrix inference and clustering." Ann. Appl. Stat. 10 (3) 1348 - 1372, September 2016.


Received: 1 May 2015; Revised: 1 January 2016; Published: September 2016
First available in Project Euclid: 28 September 2016

zbMATH: 06775269
MathSciNet: MR3553227
Digital Object Identifier: 10.1214/16-AOAS938

Keywords: ChIP-seq , clustering , E-M algorithm , State-space , transcription factors

Rights: Copyright © 2016 Institute of Mathematical Statistics

Vol.10 • No. 3 • September 2016
Back to Top