Calculating BIC

Computational Complexity

· Calculating BIC

As described by Zhong and Ghosh, and Cadez et al. 2003, the runtime of the EM while computing the BIC is O(KNc+KD²), where

o K is the number of clusters

o N is the number of elements

o c is the number of iterations (Zhong and Ghosh, assume by empirical results that the EM algorithm converges within 20 to 30 iterations).

o D is the number of dimensions.

Although most writers assume that D is very small (usually up to 3 dimensions and seldom reaches to a dozen), and therefore exclude it from the runtime complexity calculation, we find it relevant to emphasize its significant contribution to the overall runtime complexity.

Figure 1 describes the runtime (in seconds) as a function of the dataset’s dimension.

Figure 1: Runtime of the BIC calculation as a function of the number of dimensions of the datasets[1], and the corresponding quadratic trendline.

· QC

Horn and Axel (2003), showed that finding the clusters minima can be simplified to analysis of the data points only, thus reducing the time complexity to O(D∙N), where D is the number of dimensions after the SVD reduction and N is the number of data points (the potential at each point is a function of all other points). They also showed that gradient descent process involves calculation in the order of where

o c is the number of iterations required for convergence (for most practical cases, 30-50 iterations were found to be sufficient).

o D is the number of dimensions.

o N is the number of elements

· K-Means

The runtime for basic K-Means is O(c*K*D*N), where

o c is the number of iterations. (c is often small, 5-10).

o K is the number of clusters

o D is the number of dimensions.

o N is the number of elements

Cadez I. V., Heckerman D., Meek C., Smyth P., White S. Model-based Clustering and Visualization of Navigation Patterns on a Web Site. Journal of Data Mining and Knowledge Discovery, 7 (4), pp. 399-424, 2003.

Horn, D. and Axel, I. (2003) Novel clustering algorithm for microarray expression data in a truncated SVD space, Bioinformatics, 19, 1110-1115.

Zhong, S. and Ghosh, J. (2003) A unified framework for model-based clustering, Journal of Machine Learning Research 4, 1001-1037

[1] Dataset taken from the demonstration in ICA toolbox