Clustering evaluation
Purity,
Efficiency (other terms used are Precision and Recall or Specificity and
Sensitivity, respectively) and Jaccard scores are all external indices used for
clustering and classification assessment. These scores are calculated
subsequently to the clustering and their generalized version is defined as
follows:
![]()
Where:
·
n11 is the number of pairs
that are classified together, both in the ‘expert’ classification
and in the classification obtained by the algorithm.
·
n10 is the number of pairs
that are classified together in the ‘expert’ classification, but
not in the algorithm’s classification.
·
n01 is the number of pairs
that are classified together in the algorithm’s classification, but not
in the ‘expert’ classification.
The
Jaccard score reflects the ‘intersection over union' between the
algorithm's clustering assignments and the expected classification. Its values range
from 0 (no match) to 1 (perfect match), and it is a lower bound of both the
Purity and Efficiency. An important attribute of the Jaccard score is its normalization. Since the formula excludes n00 (the number of pairs that aren't classified together
in the algorithm and in the expected classification), it reflects a value that does
not depended on the size of the cluster.