COMPACT -
Comparative Package for Clustering Assessment (Version 2.0)
By
Roy Varshavsky, Michal Linial,
Outline
COMPACT is a
COMPACT is a five-step wizard that envelops some basic Matlab clustering methods and introduces the Quantum clustering algorithm that was originally proposed by Prof. David Horn and Assaf Gottlieb (see The Method of Quantum Clustering, for more details). COMPACT provides a flexible and customizable interface for clustering data with high dimensionality.
COMPACT allows both textual and graphical display for the clustering results.
How to install
COMPACT is a self-extracting package. In
order to install and run the COMPACT
tool,
follow these three
easy steps:
1.
Download the COMPACT.zip packaged to your local
drive.
2.
Add the COMPACT destination directory to your
Matlab path.
3.
Within Matlab, type 'compact' at
the command prompt.
Steps
In order to run the COMPACT properly, you should follow five steps. These steps are mandatory and cannot be skipped. However, in each one of the steps you can return to the previous step(s) and change your settings.
1.
Input
parameters
COMPACT receives two input parameters, which are Matlab base workspace variables (i.e., must be defined in Matlab workspace before running the COMPACT).
·
Data (Mandatory field) - two-dimensional matrix
of doubles. Represents the elements (objects to be clustered can be either
columns or rows).
·
Real classification (Optional field) - one-dimensional
vector. Represents the real classification of the elements (i.e., the class of
the i-th element appears in the i-th place in the vector).
Therefore, the vector’s length must be equal to the number of elements in the
Data matrix (an error occurs otherwise).
The two input parameters can be either typed
in or selected from the base workspace.
Figure 1: Input parameters
a. Selecting variables from base
workspace
Press the 'Data' or the 'Real Classification' button and double click the required variable from the list in the 'Select Variable' dialog.
Figure 2: Variables selection dialog
As explained above, the 'Real classification' parameter is optional. If this field remains empty, a popup question appears.
b. Skipping Real Classification input
If you leave the real classification empty, you will be prompted and asked whether to leave the real mapping empty (press 'Yes' button) or specify this input ('No' button). In case where real classification isn't defined, graphical displays and clustering evaluations will be modified and limited (see steps 4 and 6, respectively).
Figure 3: Missing real classification field
2.
Determining the matrix
shape and vectors to cluster
COMPACT reads the Data matrix and displays its schematic shape. You will be asked to select the entities to cluster: rows or columns. When rows are selected, each entity is a row with n(columns) dimensions (and vice versa).
For your convenience, a flip option is available. If you choose to flip the matrix, the schematic display will change accordingly and the Data matrix will be transposed.
Figure 4: Matrix shape determination
3.
Preprocessing
Procedures
The preprocessing window consists of two sections:
a. Components' variance graphs – the relative and accumulated variances of the principal components as processed by the SVD (singular values)
method,
are plotted. The variance of component i (V_{i})_{
}is defined as:
^{}.
b. Preprocessing parameters -
§ Use SVD method - yes/no check box.
§ Normalized selected vectors - yes/no check box (available only when SVD option is selected).
§ Relevant dimensions - you can select either the first x dimensions (first field), a range of dimensions (second field: from dimension #i to dimension #j) or a selection of dimensions (e.g., 1, 3-5, 7).
Figure 5: Preprocessing procedures
4.
Points distribution preview and
clustering method selection
The elements of the Data matrix are plotted (3D plot of the first 3 selected dimensions or 2D plot when only 2 dimensions are chosen). Each original class is displayed in a different color. The original classification is taken from the 'Real classification' input parameter (see step 1). If no 'Real classification' is available, all points are plotted in black.
At this stage you are asked to select the clustering method: k-means, fuzzy c-means, competitive neural network or QC (Quantum Clustering algorithm).
Figure 6: Clustering method determination
5.
Parameters for clustering
algorithms
a. Parameter for K-means algorithm (Matlab kmeans function) - number of clusters
Figure 7: K-means input parameter
b. Parameters for Fuzzy C-means algorithm (Matlab fcm function):
§ Exponent for the partition matrix U (default: 2.0)
§ Maximum number of iterations (default: 100)
§ Minimum amount of improvement (default: 1e-5)
§ Info display during iteration (default: Yes).
Figure 8: Fuzzy C-means input parameters
c. Parameters for Competitive neural network algorithm:
§ Learning rate (0 - 1)
§ Number of clusters
§ Number of learning iterations.
Figure 9: Competitive Network Input parameters
d. Parameters for Quantum Clustering algorithm*
§ Sigma value
§ Number of steps
* For more details see The Method of Quantum Clustering.
Figure 10: Quantum clustering Input parameters
6.
COMPACT results
Once the COMPACT finishes its run, the results are displayed in both graphical and textual formats.
a.
Real classification plot (upper left graph): same as
in step 4
b. Algorithm’s classification (upper right graph):
points are plotted by the same rule as in the Real Classification graph, but are
tagged by the algorithm (k-means, Competitive net or QC).
c.
Classification alternative display (lower right
graph): Assuming that the points are ordered by the real classification (i.e.,
the Real Classification array looks like: 1 1...2 2...3 3...), each change in the
background (white-gray-white...) represents a new cluster in the Real
classification. Each one of the points is represented as a bin whose location on
the x-axis equals its location in the Real classification and in the Data
matrices, and its location on the y-axis represents the tag proposed by the
algorithm. Perfect classification is therefore represented as homogenous
rectangles.
d. Purity, Efficiency and Jaccard scores (lower left
graph): These criteria for clustering assessment and are defined as
follows:
where:
§ n_{11}_{ }is the number of pairs that are classified together, both in the real classification and in the algorithm's classification.
§
n_{10} is the number of pairs that are
classified together in the real classification, but not in the algorithm's
classification.
§
n_{01} is the number of pairs that are
classified together in the algorithm's classification, but not in the real
classification.
e.
Textual summary of the clustering results (lower
right textual area).
Figure 11: COMPACT results and clustering scores
f.
Starting from version 1.3, the results can be displayed
also in a log window: the Clusters Results Log dialog is a window that
displays the summary of the clustering results and quality in a textual way (see
Figure 12). The log
window is invoked by pressing the ‘View Log’ button from the COMPACT result step.
This summary can be saved as a text (.txt) or matlab (.mat) files (menu:
File/Save/…)
Figure 12: COMPACT textual results
7.
COMPACT output
Pressing the 'End' button at this stage will terminate the application. A new variable is then added to the Matlab workspace: calcMapping.
The calcMapping variable is a one-dimensional vector that represents the calculated classification of the elements (i.e., the proposed class of the i-th element appears in the i-th place in the vector). Actually it has similar properties to the RealClassification input parameter.
8.
Optional extensions
COMPACT is a set of self
explanatory and documented Matlab functions and as such can be extended. Users
that are familiar with Matlab and wish to change features in the current tool
are welcome to do so. In addition, we designed this tool that adding a new
clustering method is done by changing only one function (see COMPACT
extension).
9. Requirements
Example
iris.mat is a
workspace that includes two variables: (a) irisData: a 4X150 double array and
(b) irisRealClassification: a 1X150 real mapping array.
[1]
To
whom comments should be addressed. E-mail: mailto:royke@cs.huji.ac.il