Model-Based Clustering for Proportional Data and Feature Selection for Cluster Discrimination

16:00, November 22, Kennedy Seminar Room 2

Mixture models are undoubtedly one of the most popular approaches for model-based clustering, with mixture of Gaussians to dominate for their wieldy principles and edible nature. Nevertheless, due to the unbounded support range of a Gaussian distribution, issues occur when working with data that live within compact spaces, such as proportions (0 – 100%). To cluster this type of data, we propose a Dirichlet Process Beta mixture model that makes no assumptions about the number of clusters. Inference is performed by using a standard Variational Bayes (sVB) approach, upgraded to a more robust version that accounts for the poor initialization of the algorithm, known as Annealed Variation Bayes (AVB). Feature selection is also performed, i.e., the detection of important features that discriminate between clusters. This talk concentrates on the comparison of sVB and AVB performance when applied to synthetic data, as well as on the “per cluster” feature selection. Real applications on microarray DNA methylation data are currently in progress. This is joint work with Leonardo Bottolo, University of Cambridge.