Journal Article
. 2006 Jul; 22(14):e431-9.
doi: 10.1093/bioinformatics/btl238.

Integrating copy number polymorphisms into array CGH analysis using a robust HMM

Sohrab P Shah 1 Xiang Xuan  Ron J DeLeeuw  Mehrnoush Khojasteh  Wan L Lam  Raymond Ng  Kevin P Murphy  
  • PMID: 16873504
  •     48 citations


Motivation: Array comparative genomic hybridization (aCGH) is a pervasive technique used to identify chromosomal aberrations in human diseases, including cancer. Aberrations are defined as regions of increased or decreased DNA copy number, relative to a normal sample. Accurately identifying the locations of these aberrations has many important medical applications. Unfortunately, the observed copy number changes are often corrupted by various sources of noise, making the boundaries hard to detect. One popular current technique uses hidden Markov models (HMMs) to divide the signal into regions of constant copy number called segments; a subsequent classification phase labels each segment as a gain, a loss or neutral. Unfortunately, standard HMMs are sensitive to outliers, causing over-segmentation, where segments erroneously span very short regions.

Results: We propose a simple modification that makes the HMM robust to such outliers. More importantly, this modification allows us to exploit prior knowledge about the likely location of "outliers", which are often due to copy number polymorphisms (CNPs). By "explaining away" these outliers with prior knowledge about the locations of CNPs, we can focus attention on the more clinically relevant aberrated regions. We show significant improvements over the current state of the art technique (DNAcopy with MergeLevels) on previously published data from mantle cell lymphoma cell lines, and on published benchmark synthetic data augmented with outliers.

Availability: Source code written in Matlab is available from

QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data.
Stefano Colella, Christopher Yau, +7 authors, Jiannis Ragoussis.
Nucleic Acids Res, 2007 Mar 08; 35(6). PMID: 17341461    Free PMC article.
Highly Cited.
A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) array.
Tianwei Yu, Hui Ye, +6 authors, Xiaofeng Zhou.
BMC Bioinformatics, 2007 May 05; 8. PMID: 17477871    Free PMC article.
Flexible and accurate detection of genomic copy-number changes from aCGH.
Oscar M Rueda, Ramón Díaz-Uriarte.
PLoS Comput Biol, 2007 Jun 26; 3(6). PMID: 17590078    Free PMC article.
Sparse representation and Bayesian detection of genome copy number alterations from microarray data.
Roger Pique-Regi, Jordi Monso-Varona, +3 authors, Shahab Asgharzadeh.
Bioinformatics, 2008 Jan 22; 24(3). PMID: 18203770    Free PMC article.
An improved method for detecting and delineating genomic regions with altered gene expression in cancer.
Björn Nilsson, Mikael Johansson, +2 authors, Thoas Fioretos.
Genome Biol, 2008 Jan 23; 9(1). PMID: 18208590    Free PMC article.
Major copy proportion analysis of tumor samples using SNP arrays.
Cheng Li, Rameen Beroukhim, +4 authors, Matthew Meyerson.
BMC Bioinformatics, 2008 Apr 23; 9. PMID: 18426588    Free PMC article.
MD-SeeGH: a platform for integrative analysis of multi-dimensional genomic data.
Bryan Chi, Ronald J deLeeuw, +3 authors, Wan L Lam.
BMC Bioinformatics, 2008 May 22; 9. PMID: 18492270    Free PMC article.
A probe-density-based analysis method for array CGH data: simulation, normalization and centralization.
Hung-I Harry Chen, Fang-Han Hsu, +5 authors, Yidong Chen.
Bioinformatics, 2008 Jul 08; 24(16). PMID: 18603568    Free PMC article.
Multiple aberrations of chromosome 3p detected in oral premalignant lesions.
Ivy F L Tsui, Miriam P Rosin, +2 authors, Wan L Lam.
Cancer Prev Res (Phila), 2009 Jan 14; 1(6). PMID: 19138989    Free PMC article.
Cancer gene discovery in mouse and man.
Jenny Mattison, Louise van der Weyden, Tim Hubbard, David J Adams.
Biochim Biophys Acta, 2009 Mar 17; 1796(2). PMID: 19285540    Free PMC article.
Model-based clustering of array CGH data.
Sohrab P Shah, K-John Cheung, +5 authors, Kevin P Murphy.
Bioinformatics, 2009 May 30; 25(12). PMID: 19478003    Free PMC article.
Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays.
Robert B Scharpf, Giovanni Parmigiani, Jonathan Pevsner, Ingo Ruczinski.
Ann Appl Stat, 2009 Jul 18; 2(2). PMID: 19609370    Free PMC article.
Statistical issues in the analysis of DNA Copy Number Variations.
Nathan E Wineinger, Richard E Kennedy, +3 authors, Hemant K Tiwari.
Int J Comput Biol Drug Des, 2008 Jan 01; 1(4). PMID: 19774103    Free PMC article.
PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data.
Chris D Greenman, Graham Bignell, +9 authors, Michael R Stratton.
Biostatistics, 2009 Oct 20; 11(1). PMID: 19837654    Free PMC article.
Highly Cited.
CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data.
Qunyuan Zhang, Li Ding, +9 authors, Michael A Province.
Bioinformatics, 2009 Dec 25; 26(4). PMID: 20031968    Free PMC article.
Somatic mutations altering EZH2 (Tyr641) in follicular and diffuse large B-cell lymphomas of germinal-center origin.
Ryan D Morin, Nathalie A Johnson, +35 authors, Marco A Marra.
Nat Genet, 2010 Jan 19; 42(2). PMID: 20081860    Free PMC article.
Highly Cited.
A method for detecting significant genomic regions associated with oral squamous cell carcinoma using aCGH.
Ki-Yeol Kim, Jin Kim, +2 authors, In-Ho Cha.
Med Biol Eng Comput, 2010 Mar 23; 48(5). PMID: 20306232
Detecting copy number variations from array CGH data based on a conditional random field model.
Xiao-Lin Yin, Jing Li.
J Bioinform Comput Biol, 2010 Apr 20; 8(2). PMID: 20401947    Free PMC article.
Epigenetic regulation of WNT signaling in chronic lymphocytic leukemia.
Lynda B Bennett, Kristen H Taylor, +3 authors, Charles W Caldwell.
Epigenomics, 2010 May 18; 2(1). PMID: 20473358    Free PMC article.
Computational analysis of whole-genome differential allelic expression data in human.
James R Wagner, Bing Ge, +3 authors, Mathieu Blanchette.
PLoS Comput Biol, 2010 Jul 16; 6(7). PMID: 20628616    Free PMC article.
Evolution of an adenocarcinoma in response to selection by targeted kinase inhibitors.
Steven Jm Jones, Janessa Laskin, +31 authors, Marco A Marra.
Genome Biol, 2010 Aug 11; 11(8). PMID: 20696054    Free PMC article.
A bayesian analysis for identifying DNA copy number variations using a compound poisson process.
Jie Chen, Ayten Yiğiter, Yu-Ping Wang, Hong-Wen Deng.
EURASIP J Bioinform Syst Biol, 2010 Oct 27; 2010. PMID: 20976296    Free PMC article.
Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.
Zhengdong D Zhang, Mark B Gerstein.
BMC Bioinformatics, 2010 Nov 03; 11. PMID: 21034510    Free PMC article.
Estimating Shared Copy Number Aberrations for Array CGH Data: The Linear-Median Method.
Y-X Lin, V Baladandayuthapani, V Bonato, K-A Do.
Cancer Inform, 2010 Nov 18; 9. PMID: 21082039    Free PMC article.
A novel approach to DNA copy number data segmentation.
Siling Wang, Yuhang Wang, Yang Xie, Guanghua Xiao.
J Bioinform Comput Biol, 2011 Feb 18; 9(1). PMID: 21328710    Free PMC article.
MHC class II transactivator CIITA is a recurrent gene fusion partner in lymphoid cancers.
Christian Steidl, Sohrab P Shah, +24 authors, Randy D Gascoyne.
Nature, 2011 Mar 04; 471(7338). PMID: 21368758    Free PMC article.
Highly Cited.
Bayesian Nonparametric Hidden Markov Models with application to the analysis of copy-number-variation in mammalian genomes.
C Yau, O Papaspiliopoulos, G O Roberts, C Holmes.
J R Stat Soc Series B Stat Methodol, 2011 Jun 21; 73(1). PMID: 21687778    Free PMC article.
Model-integrated estimation of normal tissue contamination for cancer SNP allelic copy number data.
Susann Stjernqvist, Tobias Rydén, Chris D Greenman.
Cancer Inform, 2011 Jun 23; 10. PMID: 21695067    Free PMC article.
Bayesian hierarchical mixture modeling to assign copy number from a targeted CNV array.
Niall Cardin, Chris Holmes, +2 authors, Jonathan Marchini.
Genet Epidemiol, 2011 Jul 20; 35(6). PMID: 21769931    Free PMC article.
Fast MCMC sampling for hidden Markov Models to determine copy number variations.
Md Pavel Mahmud, Alexander Schliep.
BMC Bioinformatics, 2011 Nov 04; 12. PMID: 22047014    Free PMC article.
Parsimonious higher-order hidden Markov models for improved array-CGH analysis with applications to Arabidopsis thaliana.
Michael Seifert, André Gohr, Marc Strickert, Ivo Grosse.
PLoS Comput Biol, 2012 Jan 19; 8(1). PMID: 22253580    Free PMC article.
Assessing Population Level Genetic Instability via Moving Average.
Samuel McDaniel, Jessica Minnier, +5 authors, Tianxi Cai.
Stat Biosci, 2010 Dec 01; 2(2). PMID: 22866169    Free PMC article.
Interpreting genomic data via entropic dissection.
Rajeev K Azad, Jing Li.
Nucleic Acids Res, 2012 Oct 06; 41(1). PMID: 23036836    Free PMC article.
Fast detection of de novo copy number variants from SNP arrays for case-parent trios.
Robert B Scharpf, Terri H Beaty, +3 authors, Ingo Ruczinski.
BMC Bioinformatics, 2012 Dec 14; 13. PMID: 23234608    Free PMC article.
Learning smoothing models of copy number profiles using breakpoint annotations.
Toby Dylan Hocking, Gudrun Schleiermacher, +5 authors, Jean-Philippe Vert.
BMC Bioinformatics, 2013 May 24; 14. PMID: 23697330    Free PMC article.
Genome destabilizing mutator alleles drive specific mutational trajectories in Saccharomyces cerevisiae.
Peter C Stirling, Yaoqing Shen, +2 authors, Philip Hieter.
Genetics, 2013 Dec 18; 196(2). PMID: 24336748    Free PMC article.
A bayesian integrative model for genetical genomics with spatially informed variable selection.
Alberto Cassese, Michele Guindani, Marina Vannucci.
Cancer Inform, 2014 Oct 08; 13(Suppl 2). PMID: 25288877    Free PMC article.
T cells of patients with myelodysplastic syndrome are frequently derived from the malignant clone.
Suzanne M Vercauteren, Daniel T Starczynowski, +6 authors, Aly Karsan.
Br J Haematol, 2012 Feb 01; 156(3). PMID: 25289412    Free PMC article.
NBN gain is predictive for adverse outcome following image-guided radiotherapy for localized prostate cancer.
Alejandro Berlin, Emilie Lalonde, +12 authors, Robert G Bristow.
Oncotarget, 2014 Nov 22; 5(22). PMID: 25415046    Free PMC article.
Divergent clonal selection dominates medulloblastoma at recurrence.
A Sorana Morrissy, Livia Garzia, +135 authors, Michael D Taylor.
Nature, 2016 Jan 14; 529(7586). PMID: 26760213    Free PMC article.
Highly Cited.
Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression.
John Wiedenhoeft, Eric Brugel, Alexander Schliep.
PLoS Comput Biol, 2016 May 14; 12(5). PMID: 27177143    Free PMC article.
DNA copy number profiling using single-cell sequencing.
Xuefeng Wang, Hao Chen, Nancy R Zhang.
Brief Bioinform, 2017 Feb 06; 19(5). PMID: 28159966    Free PMC article.
Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data.
Xian F Mallory, Mohammadamin Edrisi, Nicholas Navin, Luay Nakhleh.
PLoS Comput Biol, 2020 Jul 14; 16(7). PMID: 32658894    Free PMC article.
Methods for copy number aberration detection from single-cell DNA-sequencing data.
Xian F Mallory, Mohammadamin Edrisi, Nicholas Navin, Luay Nakhleh.
Genome Biol, 2020 Aug 19; 21(1). PMID: 32807205    Free PMC article.
KAT6A amplifications are associated with shorter progression-free survival and overall survival in patients with endometrial serous carcinoma.
Ozlen Saglam, Zhenya Tang, +2 authors, Gokce A Toruner.
PLoS One, 2020 Sep 03; 15(9). PMID: 32877461    Free PMC article.
Bioinformatics Analysis for Circulating Cell-Free DNA in Cancer.
Chiang-Ching Huang, Meijun Du, Liang Wang.
Cancers (Basel), 2019 Jun 20; 11(6). PMID: 31212602    Free PMC article.
Fuzzy methods for the detection of copy number variations in comparative genomic hybridization arrays.
Ahmad AlShibli, Hassan Mathkour.
Saudi J Biol Sci, 2020 Dec 12; 27(12). PMID: 33304176    Free PMC article.
Survival outcomes are associated with genomic instability in luminal breast cancers.
Lydia King, Andrew Flaus, Emma Holian, Aaron Golden.
PLoS One, 2021 Feb 04; 16(2). PMID: 33534788    Free PMC article.