Open Access
December 2008 High-dimensional classification using features annealed independence rules
Jianqing Fan, Yingying Fan
Ann. Statist. 36(6): 2605-2637 (December 2008). DOI: 10.1214/07-AOS504

Abstract

Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is poorly understood. In a seminal paper, Bickel and Levina [Bernoulli 10 (2004) 989–1010] show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as poor as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as poorly as the random guessing. Thus, it is important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

Citation

Download Citation

Jianqing Fan. Yingying Fan. "High-dimensional classification using features annealed independence rules." Ann. Statist. 36 (6) 2605 - 2637, December 2008. https://doi.org/10.1214/07-AOS504

Information

Published: December 2008
First available in Project Euclid: 5 January 2009

zbMATH: 1360.62327
MathSciNet: MR2485009
Digital Object Identifier: 10.1214/07-AOS504

Subjects:
Primary: 62G08
Secondary: 62F12 , 62J12

Keywords: ‎classification‎ , feature extraction , high dimensionality , independence rule , misclassification rates

Rights: Copyright © 2008 Institute of Mathematical Statistics

Vol.36 • No. 6 • December 2008
Back to Top