Title: | Efficient Computations of Standard Clustering Comparison Measures |
---|---|
Description: | Implements an efficient O(n) algorithm based on bucket-sorting for fast computation of standard clustering comparison measures. Available measures include adjusted Rand index (ARI), normalized information distance (NID), normalized mutual information (NMI), adjusted mutual information (AMI), normalized variation information (NVI) and entropy, as described in Vinh et al (2009) <doi:10.1145/1553374.1553511>. Include AMI (Adjusted Mutual Information) since version 0.1.2, a modified version of ARI (MARI), as described in Sundqvist et al. <doi:10.1007/s00180-022-01230-7> and simple Chi-square distance since version 1.0.0. |
Authors: | Julien Chiquet [aut, cre] , Guillem Rigaill [aut], Martina Sundqvist [aut], Valentin Dervieux [ctb], Florent Bersani [ctb] |
Maintainer: | Julien Chiquet <[email protected]> |
License: | GPL (>=3) |
Version: | 1.0.3 |
Built: | 2024-11-23 04:17:40 UTC |
Source: | https://github.com/jchiquet/aricode |
A function to compute the adjusted mutual information between two classifications
AMI(c1, c2)
AMI(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the adjusted rand index.
ARI
, RI
, NID
, NVI
, NMI
, clustComp
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) AMI(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) AMI(cl,iris$Species)
A function to compute the adjusted rand index between two classifications
ARI(c1, c2)
ARI(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the adjusted rand index.
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) ARI(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) ARI(cl,iris$Species)
A function to compute the Chi-2 statistics
Chi2(c1, c2)
Chi2(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the chi-square statistics.
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) Chi2(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) Chi2(cl,iris$Species)
A function various measures of similarity between two classifications
clustComp(c1, c2)
clustComp(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a list with the RI, ARI, NMI, NVI and NID.
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) clustComp(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) clustComp(cl,iris$Species)
A function to compute the empirical entropy for two vectors of classification and the joint entropy
entropy(c1, c2)
entropy(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a list with the two conditional entropies, the joint entropy and output of sortPairs.
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) entropy(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) entropy(cl,iris$Species)
A function to compute a modified adjusted rand index between two classifications as proposed by Sundqvist et al. in prep, based on a multinomial model.
MARI(c1, c2)
MARI(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the modified ARI.
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) MARI(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) MARI(cl,iris$Species)
A function to compute a modified adjusted rand index between two classifications as proposed by Sundqvist et al. in prep, based on a multinomial model. Raw means, that the index is not divided by the (maximum - expected) value.
MARIraw(c1, c2)
MARIraw(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the modified ARI without the division by the (maximum - expected)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) MARIraw(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) MARIraw(cl,iris$Species)
A function to compute the NID between two classifications
NID(c1, c2)
NID(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the normalized information distance .
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) NID(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) NID(cl,iris$Species)
A function to compute the NMI between two classifications
NMI(c1, c2, variant = c("max", "min", "sqrt", "sum", "joint"))
NMI(c1, c2, variant = c("max", "min", "sqrt", "sum", "joint"))
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
variant |
a string in ("max", "min", "sqrt", "sum", "joint"): different variants of NMI. Default use "max". |
a scalar with the normalized mutual information .
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) NMI(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) NMI(cl,iris$Species)
A function to compute the NVI between two classifications
NVI(c1, c2)
NVI(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the normalized variation of information.
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) NVI(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) NVI(cl,iris$Species)
A function to compute the rand index between two classifications
RI(c1, c2)
RI(c1, c2)
c1 |
a vector containing the labels of the first classification. Must be a vector of characters, integers, numerics, or a factor, but not a list. |
c2 |
a vector containing the labels of the second classification. |
a scalar with the rand index.
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) RI(cl,iris$Species)
data(iris) cl <- cutree(hclust(dist(iris[,-5])), 4) RI(cl,iris$Species)
A function to sort pairs of integers or factors and identify the pairs
sortPairs(c1, c2, spMat = FALSE)
sortPairs(c1, c2, spMat = FALSE)
c1 |
a vector of length n with value between 0 and N1 < n |
c2 |
a vector of length n with value between 0 and N2 < n |
spMat |
logical: send back the contingency table as sparsely encoded (cost more than the algorithm itself). Default is FALSE |