Research Seminar

The influence of noise variables on the selection of the number of clusters in normal mixtures using BIC


Tomoki Tokuda


KU Leuven

Abstract: The BIC (Bayesian Information Criterion) is often used for estimation of the number of clusters in mixture models. However, in the presence of noise variables that do not discriminate between clusters in data, it is empirically known that BIC tends to underestimate the underlying true number of clusters. Yet, the nature of this problem is not well understood, because rigorous quantitative analysis on this issue is lacking. In the present study, we will study the influence of the number of noise variables on the expected difference in BIC of models with different numbers of clusters. In the case of multivariate normal mixtures, we will derive analytical results of the expected BICs for one versus two cluster models and two versus three cluster models. The joint influence of the number of noise variables, the number of relevant variables, inter-cluster distance per variable and sample size is assessed. It will be shown that these results can be used practically to determine an optimal sample size through a BIC-based pseudo-power study. Furthermore, by using the analytical results for a general covariance structure model, it will be shown that, contrary to the intuition, adding relevant variables may require a larger sample size to select the correct two cluster model.
Date: Tue Dec 20, 12:00 pm - 1:00 pm
Place: room 01.07 (Department of Psychology, Tiensestraat 102, 3000 Leuven)