Journal Article
. 2020 Oct; 21(1):428.
doi: 10.1186/s12859-020-03774-1.

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Lili Blumenberg 1 Kelly V Ruggles 2 
  • PMID: 32993491
  •     26 References


Background: Unsupervised clustering is a common and exceptionally useful tool for large biological datasets. However, clustering requires upfront algorithm and hyperparameter selection, which can introduce bias into the final clustering labels. It is therefore advisable to obtain a range of clustering results from multiple models and hyperparameters, which can be cumbersome and slow.

Results: We present hypercluster, a python package and SnakeMake pipeline for flexible and parallelized clustering evaluation and selection. Users can efficiently evaluate a huge range of clustering results from multiple models and hyperparameters to identify an optimal model.

Conclusions: Hypercluster improves ease of use, robustness and reproducibility for unsupervised clustering application for high throughput biology. Hypercluster is available on pip and bioconda; installation, documentation and example workflows can be found at: .

Keywords: Hyperparameter optimization; Machine learning; Python; Scikit-learn; SnakeMake; Unsupervised clustering.

Clustering algorithms in biomedical research: a review.
Rui Xu, Donald C Wunsch.
IEEE Rev Biomed Eng, 2010 Jan 01; 3. PMID: 22275205
New Brain Tumor Entities Emerge from Molecular Classification of CNS-PNETs.
Dominik Sturm, Brent A Orr, +107 authors, Marcel Kool.
Cell, 2016 Feb 27; 164(5). PMID: 26919435    Free PMC article.
Highly Cited.
Computational cluster validation in post-genomic data analysis.
Julia Handl, Joshua Knowles, Douglas B Kell.
Bioinformatics, 2005 May 26; 21(15). PMID: 15914541
Highly Cited. Review.
Avoiding common pitfalls when clustering biological data.
Tom Ronan, Zhijie Qi, Kristen M Naegle.
Sci Signal, 2016 Jun 16; 9(432). PMID: 27303057
DNA methylation-based classification of central nervous system tumours.
David Capper, David T W Jones, +140 authors, Stefan M Pfister.
Nature, 2018 Mar 15; 555(7697). PMID: 29539639    Free PMC article.
Highly Cited.
Supervised risk predictor of breast cancer based on intrinsic subtypes.
Joel S Parker, Michael Mullins, +17 authors, Philip S Bernard.
J Clin Oncol, 2009 Feb 11; 27(8). PMID: 19204204    Free PMC article.
Highly Cited.
Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer.
Katherine A Hoadley, Christina Yau, +21 authors, Peter W Laird.
Cell, 2018 Apr 07; 173(2). PMID: 29625048    Free PMC article.
Highly Cited.
Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis.
Shiquan Sun, Jiaqiang Zhu, Ying Ma, Xiang Zhou.
Genome Biol, 2019 Dec 12; 20(1). PMID: 31823809    Free PMC article.
Comparing the performance of biomedical clustering methods.
Christian Wiwie, Jan Baumbach, Richard Röttger.
Nat Methods, 2015 Sep 22; 12(11). PMID: 26389570
An overview of clustering applied to molecular biology.
Rebecca Nugent, Marina Meila.
Methods Mol Biol, 2010 Jul 24; 620. PMID: 20652512
Clustering algorithms: A comparative approach.
Mayra Z Rodriguez, Cesar H Comin, +4 authors, Francisco A Rodrigues.
PLoS One, 2019 Jan 16; 14(1). PMID: 30645617    Free PMC article.
Highly Cited.
How does gene expression clustering work?
Patrik D'haeseleer.
Nat Biotechnol, 2005 Dec 08; 23(12). PMID: 16333293
Highly Cited. Review.
The bone marrow microenvironment at single-cell resolution.
Anastasia N Tikhonova, Igor Dolgalev, +23 authors, Iannis Aifantis.
Nature, 2019 Apr 12; 569(7755). PMID: 30971824    Free PMC article.
Highly Cited.
Proteogenomics connects somatic mutations to signalling in breast cancer.
Philipp Mertins, D R Mani, +34 authors, NCI CPTAC.
Nature, 2016 Jun 03; 534(7605). PMID: 27251275    Free PMC article.
Highly Cited.
Snakemake--a scalable bioinformatics workflow engine.
Johannes Köster, Sven Rahmann.
Bioinformatics, 2012 Aug 22; 28(19). PMID: 22908215
Highly Cited.
Molecular portraits of human breast tumours.
C M Perou, T Sørlie, +15 authors, D Botstein.
Nature, 2000 Aug 30; 406(6797). PMID: 10963602
Highly Cited.
A comparison framework and guideline of clustering methods for mass cytometry data.
Xiao Liu, Weichen Song, +4 authors, Xianting Ding.
Genome Biol, 2019 Dec 25; 20(1). PMID: 31870419    Free PMC article.
Integrative clustering of multi-level 'omic data based on non-negative matrix factorization algorithm.
Prabhakar Chalise, Brooke L Fridley.
PLoS One, 2017 May 02; 12(5). PMID: 28459819    Free PMC article.
Prognostic value of PAM50 and risk of recurrence score in patients with early-stage breast cancer with long-term follow-up.
Hege O Ohnstad, Elin Borgen, +11 authors, Bjørn Naume.
Breast Cancer Res, 2017 Nov 16; 19(1). PMID: 29137653    Free PMC article.
Integrative clustering reveals a novel split in the luminal A subtype of breast cancer with impact on outcome.
Miriam Ragle Aure, Valeria Vitelli, +32 authors, OSBREAC.
Breast Cancer Res, 2017 Mar 31; 19(1). PMID: 28356166    Free PMC article.
Genome-driven integrated classification of breast cancer validated in over 7,500 samples.
H Raza Ali, Oscar M Rueda, +4 authors, Carlos Caldas.
Genome Biol, 2014 Aug 29; 15(8). PMID: 25164602    Free PMC article.
Highly Cited.
A roadmap of clustering algorithms: finding a match for a biomedical application.
Bill Andreopoulos, Aijun An, Xiaogang Wang, Michael Schroeder.
Brief Bioinform, 2009 Feb 26; 10(3). PMID: 19240124
Prediction of new associations between ncRNAs and diseases exploiting multi-type hierarchical clustering.
Emanuele Pio Barracchia, Gianvito Pio, Domenica D'Elia, Michelangelo Ceci.
BMC Bioinformatics, 2020 Feb 26; 21(1). PMID: 32093606    Free PMC article.
Publisher Correction: Challenges in unsupervised clustering of single-cell RNA-seq data.
Vladimir Yu Kiselev, Tallulah S Andrews, Martin Hemberg.
Nat Rev Genet, 2019 Jan 24; 20(5). PMID: 30670832
The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.
Christina Curtis, Sohrab P Shah, +29 authors, Samuel Aparicio.
Nature, 2012 Apr 24; 486(7403). PMID: 22522925    Free PMC article.
Highly Cited.
MCAM: multiple clustering analysis methodology for deriving hypotheses and insights from high-throughput proteomic datasets.
Kristen M Naegle, Roy E Welsch, +2 authors, Douglas A Lauffenburger.
PLoS Comput Biol, 2011 Jul 30; 7(7). PMID: 21799663    Free PMC article.