Journal Article
. 2009 Mar;4(3).
doi: 10.1371/journal.pone.0004922.

Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data

Constantin F Aliferis 1 Alexander Statnikov  Ioannis Tsamardinos  Jonathan S Schildcrout  Bryan E Shepherd  Frank E Harrell  
Affiliations
  • PMID: 19290050
  •     38 References
  •     10 citations

Abstract

Background: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.

Methodology/Principal Findings: We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.

Conclusions/Significance: THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests.

A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification.
Wenyu Jiang, Richard Simon.
Stat Med, 2007 Jul 13; 26(29). PMID: 17624926
Power and sample size for DNA microarray studies.
Mei-Ling Ting Lee, G A Whitmore.
Stat Med, 2002 Nov 19; 21(23). PMID: 12436455
Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data.
R Simon.
Br J Cancer, 2003 Oct 30; 89(9). PMID: 14583755    Free PMC article.
Review.
Sample size planning for developing classifiers using high-dimensional DNA microarray data.
Kevin K Dobbin, Richard M Simon.
Biostatistics, 2006 Apr 15; 8(1). PMID: 16613833
Converting a breast cancer microarray signature into a high-throughput diagnostic test.
Annuska M Glas, Arno Floore, +9 authors, Laura J Van't Veer.
BMC Genomics, 2006 Nov 01; 7. PMID: 17074082    Free PMC article.
Highly Cited.
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.
A Bhattacharjee, W G Richards, +16 authors, M Meyerson.
Proc Natl Acad Sci U S A, 2001 Nov 15; 98(24). PMID: 11707567    Free PMC article.
Highly Cited.
Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.
F E Harrell, K L Lee, D B Mark.
Stat Med, 1996 Feb 28; 15(4). PMID: 8668867
Highly Cited. Review.
Multiclass cancer diagnosis using tumor gene expression signatures.
S Ramaswamy, P Tamayo, +12 authors, T R Golub.
Proc Natl Acad Sci U S A, 2001 Dec 14; 98(26). PMID: 11742071    Free PMC article.
Highly Cited.
A population-based study of tumor gene expression and risk of breast cancer death among lymph node-negative patients.
Laurel A Habel, Steven Shak, +12 authors, Charles P Quesenberry.
Breast Cancer Res, 2006 Jun 02; 8(3). PMID: 16737553    Free PMC article.
Highly Cited.
Gene expression profiling predicts clinical outcome of breast cancer.
Laura J van 't Veer, Hongyue Dai, +13 authors, Stephen H Friend.
Nature, 2002 Feb 02; 415(6871). PMID: 11823860
Highly Cited.
The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.
MAQC Consortium, Leming Shi, +135 authors, William Slikker.
Nat Biotechnol, 2006 Sep 12; 24(9). PMID: 16964229    Free PMC article.
Highly Cited.
Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.
Liat Ein-Dor, Or Zuk, Eytan Domany.
Proc Natl Acad Sci U S A, 2006 Apr 06; 103(15). PMID: 16585533    Free PMC article.
Highly Cited.
Support vector machine classification and validation of cancer tissue samples using microarray expression data.
T S Furey, N Cristianini, +3 authors, D Haussler.
Bioinformatics, 2000 Dec 20; 16(10). PMID: 11120680
Highly Cited.
The use and analysis of microarray data.
Atul Butte.
Nat Rev Drug Discov, 2002 Dec 04; 1(12). PMID: 12461517
Review.
Gene-expression profiles predict survival of patients with lung adenocarcinoma.
David G Beer, Sharon L R Kardia, +14 authors, Samir Hanash.
Nat Med, 2002 Jul 16; 8(8). PMID: 12118244
Highly Cited.
Development of the 21-gene assay and its application in clinical practice and clinical trials.
Joseph A Sparano, Soonmyung Paik.
J Clin Oncol, 2008 Feb 09; 26(5). PMID: 18258979
Highly Cited. Review.
How large a training set is needed to develop a classifier for microarray data?
Kevin K Dobbin, Yingdong Zhao, Richard M Simon.
Clin Cancer Res, 2008 Jan 04; 14(1). PMID: 18172259
Prediction of cancer outcome with microarrays: a multiple random validation strategy.
Stefan Michiels, Serge Koscielny, Catherine Hill.
Lancet, 2005 Feb 12; 365(9458). PMID: 15705458
Highly Cited.
Molecular classification of Crohn's disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells.
Michael E Burczynski, Ron L Peterson, +12 authors, Andrew J Dorner.
J Mol Diagn, 2006 Jan 27; 8(1). PMID: 16436634    Free PMC article.
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.
Alexander Statnikov, Constantin F Aliferis, +2 authors, Shawn Levy.
Bioinformatics, 2004 Sep 18; 21(5). PMID: 15374862
Highly Cited.
GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data.
Alexander Statnikov, Ioannis Tsamardinos, Yerbolat Dosbayev, Constantin F Aliferis.
Int J Med Inform, 2005 Jun 22; 74(7-8). PMID: 15967710
Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer.
Soonmyung Paik, Gong Tang, +11 authors, Norman Wolmark.
J Clin Oncol, 2006 May 25; 24(23). PMID: 16720680
Highly Cited.
A paradigm for class prediction using gene expression profiles.
Michael D Radmacher, Lisa M McShane, Richard Simon.
J Comput Biol, 2002 Aug 07; 9(3). PMID: 12162889
A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer.
Soonmyung Paik, Steven Shak, +12 authors, Norman Wolmark.
N Engl J Med, 2004 Dec 14; 351(27). PMID: 15591335
Highly Cited.
Genomic signatures to guide the use of chemotherapeutics.
Anil Potti, Holly K Dressman, +14 authors, Joseph R Nevins.
Nat Med, 2006 Oct 24; 12(11). PMID: 17057710
Highly Cited.
Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure.
Marco Saerens, Patrice Latinne, Christine Decaestecker.
Neural Comput, 2001 Dec 19; 14(1). PMID: 11747533
Challenges in the analysis of mass-throughput data: a technical commentary from the statistical machine learning perspective.
Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos.
Cancer Inform, 2007 Jan 01; 2. PMID: 19458765    Free PMC article.
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.
Alexander Statnikov, Lily Wang, Constantin F Aliferis.
BMC Bioinformatics, 2008 Jul 24; 9. PMID: 18647401    Free PMC article.
Highly Cited.
Microarrays and molecular research: noise discovery?
John P A Ioannidis.
Lancet, 2005 Feb 12; 365(9458). PMID: 15705441
Standardizing global gene expression analysis between laboratories and across platforms.
Theodore Bammler, Richard P Beyer, +62 authors, Members of the Toxicogenomics Research Consortium.
Nat Methods, 2005 Apr 23; 2(5). PMID: 15846362
Highly Cited.
Prediction of central nervous system embryonal tumour outcome based on gene expression.
Scott L Pomeroy, Pablo Tamayo, +22 authors, Todd R Golub.
Nature, 2002 Jan 25; 415(6870). PMID: 11807556
Highly Cited.
Drug target validation and identification of secondary drug target effects using DNA microarrays.
M J Marton, J L DeRisi, +11 authors, S H Friend.
Nat Med, 1998 Nov 11; 4(11). PMID: 9809554
Highly Cited.
Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling.
Eng-Juh Yeoh, Mary E Ross, +18 authors, James R Downing.
Cancer Cell, 2002 Jun 28; 1(2). PMID: 12086872
Highly Cited.
Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting.
Alain Dupuy, Richard M Simon.
J Natl Cancer Inst, 2007 Jan 18; 99(2). PMID: 17227998
Highly Cited. Review.
The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma.
Andreas Rosenwald, George Wright, +38 authors, Lymphoma/Leukemia Molecular Profiling Project.
N Engl J Med, 2002 Jun 21; 346(25). PMID: 12075054
Highly Cited.
Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection.
Norio Iizuka, Masaaki Oka, +14 authors, Yoshihiko Hamamoto.
Lancet, 2003 Mar 22; 361(9361). PMID: 12648972
Highly Cited.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
T R Golub, D K Slonim, +9 authors, E S Lander.
Science, 1999 Oct 16; 286(5439). PMID: 10521349
Highly Cited.
Is cross-validation valid for small-sample microarray classification?
Ulisses M Braga-Neto, Edward R Dougherty.
Bioinformatics, 2004 Feb 13; 20(3). PMID: 14960464
Highly Cited.
A comparison of classification methods for predicting Chronic Fatigue Syndrome based on genetic data.
Lung-Cheng Huang, Sen-Yen Hsu, Eugene Lin.
J Transl Med, 2009 Sep 24; 7. PMID: 19772600    Free PMC article.
Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms.
Yu Guo, Armin Graber, Robert N McBurney, Raji Balasubramanian.
BMC Bioinformatics, 2010 Sep 08; 11. PMID: 20815881    Free PMC article.
Using gene expression profiles from peripheral blood to identify asymptomatic responses to acute respiratory viral infections.
Alexander Statnikov, Nikita I Lytkin, +2 authors, Constantin F Aliferis.
BMC Res Notes, 2010 Oct 22; 3. PMID: 20961438    Free PMC article.
Multiclass classification of microarray data samples with a reduced number of genes.
Elizabeth Tapia, Leonardo Ornella, Pilar Bulacio, Laura Angelone.
BMC Bioinformatics, 2011 Feb 24; 12. PMID: 21342522    Free PMC article.
Causal graph-based analysis of genome-wide association data in rheumatoid arthritis.
Alexander V Alekseyenko, Nikita I Lytkin, +4 authors, Alexander Statnikov.
Biol Direct, 2011 May 20; 6. PMID: 21592391    Free PMC article.
Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections.
Nikita I Lytkin, Lauren McVoy, +2 authors, Alexander Statnikov.
PLoS One, 2011 Jun 16; 6(6). PMID: 21673802    Free PMC article.
Regression of atherosclerosis is characterized by broad changes in the plaque macrophage transcriptome.
Jonathan E Feig, Yuliya Vengrenyuk, +6 authors, Oscar Puig.
PLoS One, 2012 Jul 05; 7(6). PMID: 22761902    Free PMC article.
Comparison of classification algorithms with wrapper-based feature selection for predicting osteoporosis outcome based on genetic factors in a taiwanese women population.
Hsueh-Wei Chang, Yu-Hsien Chiu, +2 authors, Wen-Hsien Ho.
Int J Endocrinol, 2013 Feb 13; 2013. PMID: 23401685    Free PMC article.
Microbiomic signatures of psoriasis: feasibility and methodology comparison.
Alexander Statnikov, Alexander V Alekseyenko, +4 authors, Constantin F Aliferis.
Sci Rep, 2013 Sep 11; 3. PMID: 24018484    Free PMC article.
Machine learning methods to predict child posttraumatic stress: a proof of concept study.
Glenn N Saxe, Sisi Ma, Jiwen Ren, Constantin Aliferis.
BMC Psychiatry, 2017 Jul 12; 17(1). PMID: 28689495    Free PMC article.