From 3e822216aacf5f73bdaecf2cb47c66d061d646f6 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Sun, 1 Dec 2024 20:21:15 +0000 Subject: [PATCH] build based on 7a675b9 --- dev/.documenter-siteinfo.json | 2 +- dev/affprop.html | 2 +- dev/algorithms.html | 4 +- dev/assets/documenter.js | 302 ++++++++------- dev/clu_quality_data.svg | 100 +++-- dev/clu_quality_hard.svg | 292 +++++++------- dev/clu_quality_soft.svg | 194 +++++----- dev/dbscan.html | 2 +- dev/fuzzycmeans.html | 42 +- dev/hclust.html | 4 +- dev/index.html | 2 +- dev/init.html | 10 +- ...means-4ebceade.svg => kmeans-bbe89919.svg} | 362 +++++++++--------- dev/kmeans.html | 14 +- dev/kmedoids.html | 4 +- dev/mcl.html | 2 +- dev/validate.html | 10 +- 17 files changed, 681 insertions(+), 667 deletions(-) rename dev/{kmeans-4ebceade.svg => kmeans-bbe89919.svg} (69%) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index a951a3df..c4c823e0 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-11-09T03:14:46","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-12-01T20:21:09","documenter_version":"1.8.0"}} \ No newline at end of file diff --git a/dev/affprop.html b/dev/affprop.html index 517e2d6c..c4a4adba 100644 --- a/dev/affprop.html +++ b/dev/affprop.html @@ -1,3 +1,3 @@ Affinity Propagation · Clustering.jl

Affinity Propagation

Affinity propagation is a clustering algorithm based on message passing between data points. Similar to K-medoids, it looks at the (dis)similarities in the data, picks one exemplar data point for each cluster, and assigns every point in the data set to the cluster with the closest exemplar.

Clustering.affinitypropFunction
affinityprop(S::AbstractMatrix; [maxiter=200], [tol=1e-6], [damp=0.5],
-             [display=:none]) -> AffinityPropResult

Perform affinity propagation clustering based on a similarity matrix S.

$S_{ij}$ ($i ≠ j$) is the similarity (or the negated distance) between the $i$-th and $j$-th points, $S_{ii}$ defines the availability of the $i$-th point as an exemplar.

Arguments

  • damp::Real: the dampening coefficient, $0 ≤ \mathrm{damp} < 1$. Larger values indicate slower (and probably more stable) update. $\mathrm{damp} = 0$ disables dampening.
  • maxiter, tol, display: see common options

References

Brendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.

source
Clustering.AffinityPropResultType
AffinityPropResult <: ClusteringResult

The output of affinity propagation clustering (affinityprop).

Fields

  • exemplars::Vector{Int}: indices of exemplars (cluster centers)
  • assignments::Vector{Int}: cluster assignments for each data point
  • iterations::Int: number of iterations executed
  • converged::Bool: converged or not
source
+ [display=:none]) -> AffinityPropResult

Perform affinity propagation clustering based on a similarity matrix S.

$S_{ij}$ ($i ≠ j$) is the similarity (or the negated distance) between the $i$-th and $j$-th points, $S_{ii}$ defines the availability of the $i$-th point as an exemplar.

Arguments

References

Brendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.

source
Clustering.AffinityPropResultType
AffinityPropResult <: ClusteringResult

The output of affinity propagation clustering (affinityprop).

Fields

  • exemplars::Vector{Int}: indices of exemplars (cluster centers)
  • assignments::Vector{Int}: cluster assignments for each data point
  • iterations::Int: number of iterations executed
  • converged::Bool: converged or not
source
diff --git a/dev/algorithms.html b/dev/algorithms.html index b99cb8d0..f4bb8bb5 100644 --- a/dev/algorithms.html +++ b/dev/algorithms.html @@ -1,3 +1,3 @@ -Basics · Clustering.jl

Basics

The package implements a variety of clustering algorithms:

Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

  • Data matrix $X$ of size $d \times n$, the $i$-th column of $X$ (X[:, i]) is a data point (data sample) in $d$-dimensional space.
  • Distance matrix $D$ of size $n \times n$, where $D_{ij}$ is the distance between the $i$-th and $j$-th points, or the cost of assigning them to the same cluster.

Common Options

Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:

  • maxiter::Integer: maximum number of iterations.
  • tol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.
  • display::Symbol: the level of information to be displayed. It may take one of the following values:
    • :none: nothing is shown
    • :final: only shows a brief summary when the algorithm ends
    • :iter: shows the progress at each iteration

Results

A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).

The following generic methods are supported by any subtype of ClusteringResult:

StatsBase.countsMethod
counts(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster sizes.

counts(R)[k] is the number of points assigned to the $k$-th cluster.

source
Clustering.wcountsMethod
wcounts(R::ClusteringResult) -> Vector{Float64}
-wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source
Clustering.assignmentsMethod
assignments(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster indices for each point.

assignments(R)[i] is the index of the cluster to which the $i$-th point is assigned.

source
+Basics · Clustering.jl

Basics

The package implements a variety of clustering algorithms:

Most of the clustering functions in the package have a similar interface, making it easy to switch between different clustering algorithms.

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

  • Data matrix $X$ of size $d \times n$, the $i$-th column of $X$ (X[:, i]) is a data point (data sample) in $d$-dimensional space.
  • Distance matrix $D$ of size $n \times n$, where $D_{ij}$ is the distance between the $i$-th and $j$-th points, or the cost of assigning them to the same cluster.

Common Options

Many clustering algorithms are iterative procedures. The functions share the basic options for controlling the iterations:

  • maxiter::Integer: maximum number of iterations.
  • tol::Real: minimal allowed change of the objective during convergence. The algorithm is considered to be converged when the change of objective value between consecutive iterations drops below tol.
  • display::Symbol: the level of information to be displayed. It may take one of the following values:
    • :none: nothing is shown
    • :final: only shows a brief summary when the algorithm ends
    • :iter: shows the progress at each iteration

Results

A clustering function would return an object (typically, an instance of some ClusteringResult subtype) that contains both the resulting clustering (e.g. assignments of points to the clusters) and the information about the clustering algorithm (e.g. the number of iterations and whether it converged).

The following generic methods are supported by any subtype of ClusteringResult:

StatsBase.countsMethod
counts(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster sizes.

counts(R)[k] is the number of points assigned to the $k$-th cluster.

source
Clustering.wcountsMethod
wcounts(R::ClusteringResult) -> Vector{Float64}
+wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source
Clustering.assignmentsMethod
assignments(R::ClusteringResult) -> Vector{Int}

Get the vector of cluster indices for each point.

assignments(R)[i] is the index of the cluster to which the $i$-th point is assigned.

source
diff --git a/dev/assets/documenter.js b/dev/assets/documenter.js index 82252a11..7d68cd80 100644 --- a/dev/assets/documenter.js +++ b/dev/assets/documenter.js @@ -612,176 +612,194 @@ function worker_function(documenterSearchIndex, documenterBaseURL, filters) { }; } -// `worker = Threads.@spawn worker_function(documenterSearchIndex)`, but in JavaScript! -const filters = [ - ...new Set(documenterSearchIndex["docs"].map((x) => x.category)), -]; -const worker_str = - "(" + - worker_function.toString() + - ")(" + - JSON.stringify(documenterSearchIndex["docs"]) + - "," + - JSON.stringify(documenterBaseURL) + - "," + - JSON.stringify(filters) + - ")"; -const worker_blob = new Blob([worker_str], { type: "text/javascript" }); -const worker = new Worker(URL.createObjectURL(worker_blob)); - /////// SEARCH MAIN /////// -// Whether the worker is currently handling a search. This is a boolean -// as the worker only ever handles 1 or 0 searches at a time. -var worker_is_running = false; - -// The last search text that was sent to the worker. This is used to determine -// if the worker should be launched again when it reports back results. -var last_search_text = ""; - -// The results of the last search. This, in combination with the state of the filters -// in the DOM, is used compute the results to display on calls to update_search. -var unfiltered_results = []; - -// Which filter is currently selected -var selected_filter = ""; - -$(document).on("input", ".documenter-search-input", function (event) { - if (!worker_is_running) { - launch_search(); - } -}); - -function launch_search() { - worker_is_running = true; - last_search_text = $(".documenter-search-input").val(); - worker.postMessage(last_search_text); -} - -worker.onmessage = function (e) { - if (last_search_text !== $(".documenter-search-input").val()) { - launch_search(); - } else { - worker_is_running = false; - } - - unfiltered_results = e.data; - update_search(); -}; +function runSearchMainCode() { + // `worker = Threads.@spawn worker_function(documenterSearchIndex)`, but in JavaScript! + const filters = [ + ...new Set(documenterSearchIndex["docs"].map((x) => x.category)), + ]; + const worker_str = + "(" + + worker_function.toString() + + ")(" + + JSON.stringify(documenterSearchIndex["docs"]) + + "," + + JSON.stringify(documenterBaseURL) + + "," + + JSON.stringify(filters) + + ")"; + const worker_blob = new Blob([worker_str], { type: "text/javascript" }); + const worker = new Worker(URL.createObjectURL(worker_blob)); + + // Whether the worker is currently handling a search. This is a boolean + // as the worker only ever handles 1 or 0 searches at a time. + var worker_is_running = false; + + // The last search text that was sent to the worker. This is used to determine + // if the worker should be launched again when it reports back results. + var last_search_text = ""; + + // The results of the last search. This, in combination with the state of the filters + // in the DOM, is used compute the results to display on calls to update_search. + var unfiltered_results = []; + + // Which filter is currently selected + var selected_filter = ""; + + $(document).on("input", ".documenter-search-input", function (event) { + if (!worker_is_running) { + launch_search(); + } + }); -$(document).on("click", ".search-filter", function () { - if ($(this).hasClass("search-filter-selected")) { - selected_filter = ""; - } else { - selected_filter = $(this).text().toLowerCase(); + function launch_search() { + worker_is_running = true; + last_search_text = $(".documenter-search-input").val(); + worker.postMessage(last_search_text); } - // This updates search results and toggles classes for UI: - update_search(); -}); + worker.onmessage = function (e) { + if (last_search_text !== $(".documenter-search-input").val()) { + launch_search(); + } else { + worker_is_running = false; + } -/** - * Make/Update the search component - */ -function update_search() { - let querystring = $(".documenter-search-input").val(); + unfiltered_results = e.data; + update_search(); + }; - if (querystring.trim()) { - if (selected_filter == "") { - results = unfiltered_results; + $(document).on("click", ".search-filter", function () { + if ($(this).hasClass("search-filter-selected")) { + selected_filter = ""; } else { - results = unfiltered_results.filter((result) => { - return selected_filter == result.category.toLowerCase(); - }); + selected_filter = $(this).text().toLowerCase(); } - let search_result_container = ``; - let modal_filters = make_modal_body_filters(); - let search_divider = `
`; + // This updates search results and toggles classes for UI: + update_search(); + }); - if (results.length) { - let links = []; - let count = 0; - let search_results = ""; - - for (var i = 0, n = results.length; i < n && count < 200; ++i) { - let result = results[i]; - if (result.location && !links.includes(result.location)) { - search_results += result.div; - count++; - links.push(result.location); - } - } + /** + * Make/Update the search component + */ + function update_search() { + let querystring = $(".documenter-search-input").val(); - if (count == 1) { - count_str = "1 result"; - } else if (count == 200) { - count_str = "200+ results"; + if (querystring.trim()) { + if (selected_filter == "") { + results = unfiltered_results; } else { - count_str = count + " results"; + results = unfiltered_results.filter((result) => { + return selected_filter == result.category.toLowerCase(); + }); } - let result_count = `
${count_str}
`; - search_result_container = ` + let search_result_container = ``; + let modal_filters = make_modal_body_filters(); + let search_divider = `
`; + + if (results.length) { + let links = []; + let count = 0; + let search_results = ""; + + for (var i = 0, n = results.length; i < n && count < 200; ++i) { + let result = results[i]; + if (result.location && !links.includes(result.location)) { + search_results += result.div; + count++; + links.push(result.location); + } + } + + if (count == 1) { + count_str = "1 result"; + } else if (count == 200) { + count_str = "200+ results"; + } else { + count_str = count + " results"; + } + let result_count = `
${count_str}
`; + + search_result_container = ` +
+ ${modal_filters} + ${search_divider} + ${result_count} +
+ ${search_results} +
+
+ `; + } else { + search_result_container = `
${modal_filters} ${search_divider} - ${result_count} -
- ${search_results} -
-
+
0 result(s)
+ +
No result found!
`; - } else { - search_result_container = ` -
- ${modal_filters} - ${search_divider} -
0 result(s)
-
-
No result found!
- `; - } + } - if ($(".search-modal-card-body").hasClass("is-justify-content-center")) { - $(".search-modal-card-body").removeClass("is-justify-content-center"); - } + if ($(".search-modal-card-body").hasClass("is-justify-content-center")) { + $(".search-modal-card-body").removeClass("is-justify-content-center"); + } - $(".search-modal-card-body").html(search_result_container); - } else { - if (!$(".search-modal-card-body").hasClass("is-justify-content-center")) { - $(".search-modal-card-body").addClass("is-justify-content-center"); + $(".search-modal-card-body").html(search_result_container); + } else { + if (!$(".search-modal-card-body").hasClass("is-justify-content-center")) { + $(".search-modal-card-body").addClass("is-justify-content-center"); + } + + $(".search-modal-card-body").html(` +
Type something to get started!
+ `); } + } - $(".search-modal-card-body").html(` -
Type something to get started!
- `); + /** + * Make the modal filter html + * + * @returns string + */ + function make_modal_body_filters() { + let str = filters + .map((val) => { + if (selected_filter == val.toLowerCase()) { + return `${val}`; + } else { + return `${val}`; + } + }) + .join(""); + + return ` +
+ Filters: + ${str} +
`; } } -/** - * Make the modal filter html - * - * @returns string - */ -function make_modal_body_filters() { - let str = filters - .map((val) => { - if (selected_filter == val.toLowerCase()) { - return `${val}`; - } else { - return `${val}`; - } - }) - .join(""); - - return ` -
- Filters: - ${str} -
`; +function waitUntilSearchIndexAvailable() { + // It is possible that the documenter.js script runs before the page + // has finished loading and documenterSearchIndex gets defined. + // So we need to wait until the search index actually loads before setting + // up all the search-related stuff. + if (typeof documenterSearchIndex !== "undefined") { + runSearchMainCode(); + } else { + console.warn("Search Index not available, waiting"); + setTimeout(waitUntilSearchIndexAvailable, 1000); + } } +// The actual entry point to the search code +waitUntilSearchIndexAvailable(); + }) //////////////////////////////////////////////////////////////////////////////// require(['jquery'], function($) { diff --git a/dev/clu_quality_data.svg b/dev/clu_quality_data.svg index 67a5c2e7..3fbbf7b2 100644 --- a/dev/clu_quality_data.svg +++ b/dev/clu_quality_data.svg @@ -1,68 +1,64 @@ - + - + - + - + - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - + + - + diff --git a/dev/clu_quality_hard.svg b/dev/clu_quality_hard.svg index 02f7a5b7..21e1198f 100644 --- a/dev/clu_quality_hard.svg +++ b/dev/clu_quality_hard.svgdiff --git a/dev/clu_quality_soft.svg b/dev/clu_quality_soft.svg index c876ff29..e169b5dd 100644 --- a/dev/clu_quality_soft.svg +++ b/dev/clu_quality_soft.svg @@ -1,116 +1,116 @@ - + - + - + - + - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - + + - + diff --git a/dev/dbscan.html b/dev/dbscan.html index 67cadf43..f8e7d183 100644 --- a/dev/dbscan.html +++ b/dev/dbscan.html @@ -4,4 +4,4 @@ [min_neighbors=1], [min_cluster_size=1], [nntree_kwargs...]) -> DbscanResult

Cluster points using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

Arguments

  • points: when metric is specified, the d×n matrix, where each column is a d-dimensional coordinate of a point; when metric=nothing, the n×n matrix of pairwise distances between the points
  • radius::Real: neighborhood radius; points within this distance are considered neighbors

Optional keyword arguments to control the algorithm:

  • metric (defaults to Euclidean()): the points distance metric to use, nothing means points is the n×n precalculated distance matrix
  • min_neighbors::Integer (defaults to 1): the minimal number of neighbors required to assign a point to a cluster "core"
  • min_cluster_size::Integer (defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discarded
  • nntree_kwargs...: parameters (like leafsize) for the KDTree constructor

Example

points = randn(3, 10000)
 # DBSCAN clustering, clusters with less than 20 points will be discarded:
-clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

  • Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", KDD-1996, pp. 226–231.
  • Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335
source
Clustering.DbscanResultType
DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

  • clusters::Vector{DbscanCluster}: clusters, length K
  • seeds::Vector{Int}: indices of the first points of each cluster's core, length K
  • counts::Vector{Int}: cluster sizes (number of assigned points), length K
  • assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N
source
Clustering.DbscanClusterType
DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

  • size::Int: number of points in a cluster (core + boundary)
  • core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
  • boundary_indices::Vector{Int}: indices of the cluster points outside of core
source
+clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

  • Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", KDD-1996, pp. 226–231.
  • Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335
source
Clustering.DbscanResultType
DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

  • clusters::Vector{DbscanCluster}: clusters, length K
  • seeds::Vector{Int}: indices of the first points of each cluster's core, length K
  • counts::Vector{Int}: cluster sizes (number of assigned points), length K
  • assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N
source
Clustering.DbscanClusterType
DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

  • size::Int: number of points in a cluster (core + boundary)
  • core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
  • boundary_indices::Vector{Int}: indices of the cluster points outside of core
source
diff --git a/dev/fuzzycmeans.html b/dev/fuzzycmeans.html index e60b333d..2fe67635 100644 --- a/dev/fuzzycmeans.html +++ b/dev/fuzzycmeans.html @@ -1,8 +1,8 @@ Fuzzy C-means · Clustering.jl

Fuzzy C-means

Fuzzy C-means is a clustering method that provides cluster membership weights instead of "hard" classification (e.g. K-means).

From a mathematical standpoint, fuzzy C-means solves the following optimization problem:

\[\arg\min_\mathcal{C} \ \sum_{i=1}^n \sum_{j=1}^C w_{ij}^\mu \| \mathbf{x}_i - \mathbf{c}_j \|^2, \ \text{where}\ w_{ij} = \left(\sum_{k=1}^{C} \left(\frac{\left\|\mathbf{x}_i - \mathbf{c}_j \right\|}{\left\|\mathbf{x}_i - \mathbf{c}_k \right\|}\right)^{\frac{2}{\mu-1}}\right)^{-1}\]

Here, $\mathbf{c}_j$ is the center of the $j$-th cluster, $w_{ij}$ is the membership weight of the $i$-th point in the $j$-th cluster, and $\mu > 1$ is a user-defined fuzziness parameter.

Clustering.fuzzy_cmeansFunction
fuzzy_cmeans(data::AbstractMatrix, C::Integer, fuzziness::Real;
-             [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult

Perform Fuzzy C-means clustering over the given data.

Arguments

  • data::AbstractMatrix: $d×n$ data matrix. Each column represents one $d$-dimensional data point.
  • C::Integer: the number of fuzzy clusters, $2 ≤ C < n$.
  • fuzziness::Real: clusters fuzziness ($μ$ in the mathematical formulation), $μ > 1$.

Optional keyword arguments:

  • dist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points
  • maxiter, tol, display, rng: see common options
source
Clustering.FuzzyCMeansResultType
FuzzyCMeansResult{T<:AbstractFloat}

The output of fuzzy_cmeans function.

Fields

  • centers::Matrix{T}: the $d×C$ matrix with columns being the centers of resulting fuzzy clusters
  • weights::Matrix{Float64}: the $n×C$ matrix of assignment weights ($\mathrm{weights}_{ij}$ is the weight (probability) of assigning $i$-th point to the $j$-th cluster)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source
Clustering.wcountsFunction
wcounts(R::ClusteringResult) -> Vector{Float64}
-wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source

Examples

using Clustering
+             [dist_metric::SemiMetric], [...]) -> FuzzyCMeansResult

Perform Fuzzy C-means clustering over the given data.

Arguments

  • data::AbstractMatrix: $d×n$ data matrix. Each column represents one $d$-dimensional data point.
  • C::Integer: the number of fuzzy clusters, $2 ≤ C < n$.
  • fuzziness::Real: clusters fuzziness ($μ$ in the mathematical formulation), $μ > 1$.

Optional keyword arguments:

  • dist_metric::SemiMetric (defaults to Euclidean): the SemiMetric object that defines the distance between the data points
  • maxiter, tol, display, rng: see common options
source
Clustering.FuzzyCMeansResultType
FuzzyCMeansResult{T<:AbstractFloat}

The output of fuzzy_cmeans function.

Fields

  • centers::Matrix{T}: the $d×C$ matrix with columns being the centers of resulting fuzzy clusters
  • weights::Matrix{Float64}: the $n×C$ matrix of assignment weights ($\mathrm{weights}_{ij}$ is the weight (probability) of assigning $i$-th point to the $j$-th cluster)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source
Clustering.wcountsFunction
wcounts(R::ClusteringResult) -> Vector{Float64}
+wcounts(R::FuzzyCMeansResult) -> Vector{Float64}

Get the weighted cluster sizes as the sum of weights of points assigned to each cluster.

For non-weighted clusterings assumes the weight of every data point is 1.0, so the result is equivalent to convert(Vector{Float64}, counts(R)).

source

Examples

using Clustering
 
 # make a random dataset with 1000 points
 # each point is a 5-dimensional vector
@@ -21,23 +21,23 @@
 # get the point memberships over all the clusters
 # memberships is a 20x3 matrix
 memberships = R.weights
1000×3 Matrix{Float64}:
- 0.331547  0.3372    0.331253
- 0.334966  0.331843  0.333191
- 0.332288  0.333973  0.333739
- 0.335377  0.3303    0.334323
- 0.331942  0.333481  0.334577
- 0.330755  0.336604  0.332641
- 0.336519  0.329899  0.333582
- 0.330629  0.33727   0.332101
- 0.33612   0.329477  0.334403
- 0.33436   0.333144  0.332497
+ 0.33434   0.331047  0.334612
+ 0.334198  0.333619  0.332183
+ 0.333612  0.33434   0.332048
+ 0.334455  0.331583  0.333961
+ 0.33395   0.334411  0.331639
+ 0.333246  0.334795  0.331959
+ 0.33284   0.333236  0.333924
+ 0.331636  0.33434   0.334024
+ 0.333507  0.332945  0.333548
+ 0.332767  0.335326  0.331907
  ⋮                   
- 0.332021  0.333548  0.334431
- 0.334237  0.330794  0.334968
- 0.336422  0.329061  0.334517
- 0.334682  0.33261   0.332707
- 0.334882  0.331571  0.333548
- 0.332936  0.33261   0.334454
- 0.332097  0.334498  0.333406
- 0.328874  0.339566  0.33156
- 0.335696  0.33333   0.330973
+ 0.331098 0.334792 0.33411 + 0.331947 0.337083 0.33097 + 0.333607 0.331842 0.334551 + 0.331902 0.333937 0.334162 + 0.334562 0.332209 0.333229 + 0.335227 0.332442 0.332332 + 0.332186 0.331953 0.33586 + 0.331836 0.334882 0.333282 + 0.333181 0.33488 0.331939 diff --git a/dev/hclust.html b/dev/hclust.html index 7faa16b8..033c94b6 100644 --- a/dev/hclust.html +++ b/dev/hclust.html @@ -1,5 +1,5 @@ -Hierarchical Clustering · Clustering.jl

Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):

Clustering.hclustFunction
hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust

Perform hierarchical clustering using the distance matrix d and the cluster linkage function.

Returns the dendrogram as a Hclust object.

Arguments

  • d::AbstractMatrix: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.
  • linkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
    • :single (the default): use the minimum distance between any of the cluster members
    • :average: use the mean distance between any of the cluster members
    • :complete: use the maximum distance between any of the members
    • :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
    • :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
  • uplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.
  • branchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:
    • :r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)
    • :barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
Clustering.HclustType
Hclust{T<:Real}

The output of hclust, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's hclust class.

Fields

  • merges::Matrix{Int}: $N×2$ matrix encoding subtree merges:
    • each row specifies the left and right subtrees (referenced by their $id$s) that are merged
    • negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
    • positive $id$ denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)
  • linkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)
  • heights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage
  • order::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: hclust.

source

Single-linkage clustering using distance matrix:

using Clustering
+Hierarchical Clustering · Clustering.jl

Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The hclust function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the linkage parameter):

Clustering.hclustFunction
hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust

Perform hierarchical clustering using the distance matrix d and the cluster linkage function.

Returns the dendrogram as a Hclust object.

Arguments

  • d::AbstractMatrix: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.
  • linkage::Symbol: cluster linkage function to use. linkage defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:
    • :single (the default): use the minimum distance between any of the cluster members
    • :average: use the mean distance between any of the cluster members
    • :complete: use the maximum distance between any of the members
    • :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
    • :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
  • uplo::Symbol (optional): specifies whether the upper (:U) or the lower (:L) triangle of d should be used to get the distances. If not specified, the method expects d to be symmetric.
  • branchorder::Symbol (optional): algorithm to order leaves and branches. The valid choices are:
    • :r (the default): ordering based on the node heights and the original elements order (compatible with R's hclust)
    • :barjoseph (or :optimal): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al. Bioinformatics (2001)
source
Clustering.HclustType
Hclust{T<:Real}

The output of hclust, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's hclust class.

Fields

  • merges::Matrix{Int}: $N×2$ matrix encoding subtree merges:
    • each row specifies the left and right subtrees (referenced by their $id$s) that are merged
    • negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
    • positive $id$ denotes nontrivial subtree (the row merges[id, :] specifies its left and right subtrees)
  • linkage::Symbol: the name of cluster linkage function used to construct the hierarchy (see hclust)
  • heights::Vector{T}: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified linkage
  • order::Vector{Int}: the data point indices ordered so that there are no intersecting branches on the dendrogram plot. This ordering also puts the points of the same cluster close together.

See also: hclust.

source

Single-linkage clustering using distance matrix:

using Clustering
 D = rand(1000, 1000);
 D += D'; # symmetric distance matrix (optional)
-result = hclust(D, linkage=:single)
Hclust{Float64}([-43 -563; -89 -340; … ; -965 997; -537 998], [0.0011784747598576617, 0.0020621402489011675, 0.002205510428833657, 0.0023022856901746547, 0.0025748580696793866, 0.0028192853461130873, 0.0034057451218011403, 0.004250944650608712, 0.004586261434821326, 0.004793524374898439  …  0.09962207180551419, 0.09976152056947407, 0.1018549437247126, 0.105220554470696, 0.10529865595088272, 0.11204010438718115, 0.1137945171448681, 0.11591864893770931, 0.1290067998868235, 0.13090603819635727], [537, 965, 45, 986, 145, 87, 5, 102, 749, 747  …  276, 585, 425, 580, 855, 912, 936, 632, 554, 654], :single)

The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.

Clustering.cutreeFunction
cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

Cut the hclu dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

Arguments

  • k::Integer (optional) the number of desired clusters.
  • h::Real (optional) the height at which the tree is cut.

If both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.

See also: hclust

source
+result = hclust(D, linkage=:single)
Hclust{Float64}([-169 -892; -86 -367; … ; -343 997; -35 998], [0.0015428578887612954, 0.002736735390239886, 0.003924001099635643, 0.004121093708125079, 0.004510234313405248, 0.0045289704966777755, 0.0052701819970920605, 0.006836808958871399, 0.006918593296885489, 0.006989519474230876  …  0.09189410141848808, 0.09360172432041025, 0.0943248823487971, 0.09633363016504559, 0.10033366371517949, 0.10878201738842352, 0.11430185157603057, 0.12029869624690526, 0.12870550975811612, 0.13128945516202328], [35, 343, 39, 754, 887, 31, 560, 294, 131, 213  …  932, 210, 585, 459, 561, 782, 582, 291, 155, 358], :single)

The resulting dendrogram could be converted into disjoint clusters with the help of cutree function.

Clustering.cutreeFunction
cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

Cut the hclu dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

Arguments

  • k::Integer (optional) the number of desired clusters.
  • h::Real (optional) the height at which the tree is cut.

If both k and h are specified, it's guaranteed that the number of clusters is not less than k and their height is not above h.

See also: hclust

source
diff --git a/dev/index.html b/dev/index.html index c7d45385..6102ebc5 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Introduction · Clustering.jl
+Introduction · Clustering.jl
diff --git a/dev/init.html b/dev/init.html index e43a682d..f02d1771 100644 --- a/dev/init.html +++ b/dev/init.html @@ -1,6 +1,6 @@ -Initialization · Clustering.jl

Initialization

A clustering algorithm usually requires initialization before it could be started.

Seeding

Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).

Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:

Clustering.initseeds!Function
initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
-           X::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.

source
Clustering.initseeds_by_costs!Function
initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
-                    costs::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.

Here, costs[i, j] is the cost of assigning points $i$ and $j$ to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

source

There are several seeding methods described in the literature. Clustering.jl implements three popular ones:

Clustering.KmppAlgType
KmppAlg <: SeedingAlgorithm

Kmeans++ seeding (:kmpp).

Chooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.

References

D. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.

source
Clustering.KmCentralityAlgType
KmCentralityAlg <: SeedingAlgorithm

K-medoids initialization based on centrality (:kmcen).

Choose the $k$ points with the highest centrality as seeds.

References

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039

source
Clustering.RandSeedAlgType
RandSeedAlg <: SeedingAlgorithm

Random seeding (:rand).

Chooses an arbitrary subset of $k$ data points as cluster seeds.

source

In practice, we have found that Kmeans++ is the most effective choice.

For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:

Clustering.initseedsFunction
initseeds(alg::Union{SeedingAlgorithm, Symbol},
-          X::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from a $d×n$ data matrix X using the alg algorithm.

alg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.

Returns the vector of k seed indices.

source
Clustering.initseeds_by_costsFunction
initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},
-                   costs::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from the $n×n$ costs matrix using algorithm alg.

Here, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

Returns the vector of k seed indices.

source
+Initialization · Clustering.jl

Initialization

A clustering algorithm usually requires initialization before it could be started.

Seeding

Seeding is a type of clustering initialization, which provides a few seeds – points from a data set that would serve as the initial cluster centers (one for each cluster).

Each seeding algorithm implemented by Clustering.jl is a subtype of SeedingAlgorithm:

Clustering.initseeds!Function
initseeds!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
+           X::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the X data matrix using the alg seeding algorithm.

source
Clustering.initseeds_by_costs!Function
initseeds_by_costs!(iseeds::AbstractVector{Int}, alg::SeedingAlgorithm,
+                    costs::AbstractMatrix) -> iseeds

Initialize iseeds with the indices of cluster seeds for the costs matrix using the alg seeding algorithm.

Here, costs[i, j] is the cost of assigning points $i$ and $j$ to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

source

There are several seeding methods described in the literature. Clustering.jl implements three popular ones:

Clustering.KmppAlgType
KmppAlg <: SeedingAlgorithm

Kmeans++ seeding (:kmpp).

Chooses the seeds sequentially. The probability of a point to be chosen is proportional to the minimum cost of assigning it to the existing seeds.

References

D. Arthur and S. Vassilvitskii (2007). k-means++: the advantages of careful seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.

source
Clustering.KmCentralityAlgType
KmCentralityAlg <: SeedingAlgorithm

K-medoids initialization based on centrality (:kmcen).

Choose the $k$ points with the highest centrality as seeds.

References

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering. doi:10.1016/j.eswa.2008.01.039

source
Clustering.RandSeedAlgType
RandSeedAlg <: SeedingAlgorithm

Random seeding (:rand).

Chooses an arbitrary subset of $k$ data points as cluster seeds.

source

In practice, we have found that Kmeans++ is the most effective choice.

For convenience, the package defines the two wrapper functions that accept the short name of the seeding algorithm and the number of clusters and take care of allocating iseeds and applying the proper SeedingAlgorithm:

Clustering.initseedsFunction
initseeds(alg::Union{SeedingAlgorithm, Symbol},
+          X::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from a $d×n$ data matrix X using the alg algorithm.

alg could be either an instance of SeedingAlgorithm or a symbolic name of the algorithm.

Returns the vector of k seed indices.

source
Clustering.initseeds_by_costsFunction
initseeds_by_costs(alg::Union{SeedingAlgorithm, Symbol},
+                   costs::AbstractMatrix, k::Integer) -> Vector{Int}

Select k seeds from the $n×n$ costs matrix using algorithm alg.

Here, costs[i, j] is the cost of assigning points iandj` to the same cluster. One may, for example, use the squared Euclidean distance between the points as the cost.

Returns the vector of k seed indices.

source
diff --git a/dev/kmeans-4ebceade.svg b/dev/kmeans-bbe89919.svg similarity index 69% rename from dev/kmeans-4ebceade.svg rename to dev/kmeans-bbe89919.svg index e1aa33d5..a1e02fee 100644 --- a/dev/kmeans-4ebceade.svg +++ b/dev/kmeans-bbe89919.svgdiff --git a/dev/kmeans.html b/dev/kmeans.html index 077cbfca..b28c2193 100644 --- a/dev/kmeans.html +++ b/dev/kmeans.html @@ -1,5 +1,5 @@ -K-means · Clustering.jl

K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

Clustering.kmeansFunction
kmeans(X, k, [...]) -> KmeansResult

K-means clustering of the $d×n$ data matrix X (each column of X is a $d$-dimensional data point) into k clusters.

Arguments

  • init (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:
    • a Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);
    • an instance of SeedingAlgorithm;
    • an integer vector of length $k$ that provides the indices of points to use as initial seeds.
  • weights: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)
  • maxiter, tol, display: see common options
source
Clustering.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight
source

If you already have a set of initial center vectors, kmeans! could be used:

Clustering.kmeans!Function
kmeans!(X, centers; [kwargs...]) -> KmeansResult

Update the current cluster centers ($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X (each column of X is a $d$-dimensional data point).

See kmeans for the description of optional kwargs.

source

Examples

using Clustering
+K-means · Clustering.jl

K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

Clustering.kmeansFunction
kmeans(X, k, [...]) -> KmeansResult

K-means clustering of the $d×n$ data matrix X (each column of X is a $d$-dimensional data point) into k clusters.

Arguments

  • init (defaults to :kmpp): how cluster seeds should be initialized, could be one of the following:
    • a Symbol, the name of a seeding algorithm (see Seeding for a list of supported methods);
    • an instance of SeedingAlgorithm;
    • an integer vector of length $k$ that provides the indices of points to use as initial seeds.
  • weights: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)
  • maxiter, tol, display: see common options
source
Clustering.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight
source

If you already have a set of initial center vectors, kmeans! could be used:

Clustering.kmeans!Function
kmeans!(X, centers; [kwargs...]) -> KmeansResult

Update the current cluster centers ($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix X (each column of X is a $d$-dimensional data point).

See kmeans for the description of optional kwargs.

source

Examples

using Clustering
 
 # make a random dataset with 1000 random 5-dimensional points
 X = rand(5, 1000)
@@ -12,11 +12,11 @@
 a = assignments(R) # get the assignments of points to clusters
 c = counts(R) # get the cluster sizes
 M = R.centers # get the cluster centers
5×20 Matrix{Float64}:
- 0.176858  0.830428  0.28105   0.38975   …  0.827002  0.221205  0.209538
- 0.659943  0.325135  0.230565  0.725787     0.546358  0.820589  0.228468
- 0.785533  0.196622  0.282203  0.796344     0.256191  0.648165  0.323288
- 0.275849  0.711221  0.246314  0.791387     0.231751  0.231231  0.774473
- 0.711626  0.673011  0.838729  0.602332     0.330237  0.16094   0.325637

Scatter plot of the K-means clustering results:

using RDatasets, Clustering, Plots
+ 0.613434  0.827851  0.415703  0.761409  …  0.24174   0.416385  0.312106
+ 0.215067  0.702396  0.773295  0.377902     0.478804  0.780879  0.187333
+ 0.222829  0.717061  0.280821  0.232335     0.695157  0.797438  0.205332
+ 0.267597  0.74069   0.770797  0.78535      0.196763  0.772653  0.680894
+ 0.240089  0.670621  0.751962  0.390059     0.171474  0.21502   0.732506

Scatter plot of the K-means clustering results:

using RDatasets, Clustering, Plots
 iris = dataset("datasets", "iris"); # load the data
 
 features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
@@ -24,4 +24,4 @@
 
 # plot with the point color mapped to the assigned cluster index
 scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
-        color=:lightrainbow, legend=false)
Example block output
+ color=:lightrainbow, legend=false)
Example block output
diff --git a/dev/kmedoids.html b/dev/kmedoids.html index 1e6fdc8d..5f6ffc07 100644 --- a/dev/kmedoids.html +++ b/dev/kmedoids.html @@ -1,3 +1,3 @@ -K-medoids · Clustering.jl

K-medoids

K-medoids is a clustering algorithm that works by finding $k$ data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.

Clustering.kmedoidsFunction
kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult

Perform K-medoids clustering of $n$ points into k clusters, given the dist matrix ($n×n$, dist[i, j] is the distance between the j-th and i-th points).

Arguments

  • init (defaults to :kmpp): how medoids should be initialized, could be one of the following:
    • a Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).
    • an integer vector of length k that provides the indices of points to use as initial medoids.
  • maxiter, tol, display: see common options

Note

The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).

source
Clustering.kmedoids!Function
kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};
-          [kwargs...]) -> KmedoidsResult

Update the current cluster medoids using the dist matrix.

The medoids field of the returned KmedoidsResult points to the same array as medoids argument.

See kmedoids for the description of optional kwargs.

source
Clustering.KmedoidsResultType
KmedoidsResult{T} <: ClusteringResult

The output of kmedoids function.

Fields

  • medoids::Vector{Int}: the indices of $k$ medoids
  • assignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the $i$-th point
  • costs::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning $i$-th point to its medoid
  • counts::Vector{Int}: cluster sizes
  • totalcost::Float64: total assignment cost (the sum of costs)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source

References

  1. Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955
  2. Schubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16
+K-medoids · Clustering.jl

K-medoids

K-medoids is a clustering algorithm that works by finding $k$ data points (called medoids) such that the total distance between each data point and the closest medoid is minimal.

Clustering.kmedoidsFunction
kmedoids(dist::AbstractMatrix, k::Integer; ...) -> KmedoidsResult

Perform K-medoids clustering of $n$ points into k clusters, given the dist matrix ($n×n$, dist[i, j] is the distance between the j-th and i-th points).

Arguments

  • init (defaults to :kmpp): how medoids should be initialized, could be one of the following:
    • a Symbol indicating the name of a seeding algorithm (see Seeding for a list of supported methods).
    • an integer vector of length k that provides the indices of points to use as initial medoids.
  • maxiter, tol, display: see common options

Note

The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).

source
Clustering.kmedoids!Function
kmedoids!(dist::AbstractMatrix, medoids::Vector{Int};
+          [kwargs...]) -> KmedoidsResult

Update the current cluster medoids using the dist matrix.

The medoids field of the returned KmedoidsResult points to the same array as medoids argument.

See kmedoids for the description of optional kwargs.

source
Clustering.KmedoidsResultType
KmedoidsResult{T} <: ClusteringResult

The output of kmedoids function.

Fields

  • medoids::Vector{Int}: the indices of $k$ medoids
  • assignments::Vector{Int}: the indices of clusters the points are assigned to, so that medoids[assignments[i]] is the index of the medoid for the $i$-th point
  • costs::Vector{T}: assignment costs, i.e. costs[i] is the cost of assigning $i$-th point to its medoid
  • counts::Vector{Int}: cluster sizes
  • totalcost::Float64: total assignment cost (the sum of costs)
  • iterations::Int: the number of executed algorithm iterations
  • converged::Bool: whether the procedure converged
source

References

  1. Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955
  2. Schubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16
diff --git a/dev/mcl.html b/dev/mcl.html index cea5a55d..7c90a14b 100644 --- a/dev/mcl.html +++ b/dev/mcl.html @@ -1,2 +1,2 @@ -MCL (Markov Cluster Algorithm) · Clustering.jl

MCL (Markov Cluster Algorithm)

Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).

Clustering.mclFunction
mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult

Perform MCL (Markov Cluster Algorithm) clustering using $n×n$ adjacency (points similarity) matrix adj.

Arguments

Keyword arguments to control the MCL algorithm:

  • add_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph
  • expansion::Number (defaults to 2): MCL expansion constant
  • inflation::Number (defaults to 2): MCL inflation constant
  • save_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge
  • prune_tol::Number: pruning threshold
  • display, maxiter, tol: see common options

References

Stijn van Dongen, "Graph clustering by flow simulation", 2001

Original MCL implementation.

source
Clustering.MCLResultType
MCLResult <: ClusteringResult

The output of mcl function.

Fields

  • mcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled
  • assignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the $i$-th point ($0$ if unassigned)
  • counts::Vector{Int}: the $k$-length vector of cluster sizes
  • nunassigned::Int: the number of standalone points not assigned to any cluster
  • iterations::Int: the number of elapsed iterations
  • rel_Δ::Float64: the final relative Δ
  • converged::Bool: whether the method converged
source
+MCL (Markov Cluster Algorithm) · Clustering.jl

MCL (Markov Cluster Algorithm)

Markov Cluster Algorithm works by simulating a stochastic (Markov) flow in a weighted graph, where each node is a data point, and the edge weights are defined by the adjacency matrix. ... When the algorithm converges, it produces the new edge weights that define the new connected components of the graph (i.e. the clusters).

Clustering.mclFunction
mcl(adj::AbstractMatrix; [kwargs...]) -> MCLResult

Perform MCL (Markov Cluster Algorithm) clustering using $n×n$ adjacency (points similarity) matrix adj.

Arguments

Keyword arguments to control the MCL algorithm:

  • add_loops::Bool (enabled by default): whether the edges of weight 1.0 from the node to itself should be appended to the graph
  • expansion::Number (defaults to 2): MCL expansion constant
  • inflation::Number (defaults to 2): MCL inflation constant
  • save_final_matrix::Bool (disabled by default): whether to save the final equilibrium state in the mcl_adj field of the result; could provide useful diagnostic if the method doesn't converge
  • prune_tol::Number: pruning threshold
  • display, maxiter, tol: see common options

References

Stijn van Dongen, "Graph clustering by flow simulation", 2001

Original MCL implementation.

source
Clustering.MCLResultType
MCLResult <: ClusteringResult

The output of mcl function.

Fields

  • mcl_adj::AbstractMatrix: the final MCL adjacency matrix (equilibrium state matrix if the algorithm converged), empty if save_final_matrix option is disabled
  • assignments::Vector{Int}: indices of the points clusters. assignments[i] is the index of the cluster for the $i$-th point ($0$ if unassigned)
  • counts::Vector{Int}: the $k$-length vector of cluster sizes
  • nunassigned::Int: the number of standalone points not assigned to any cluster
  • iterations::Int: the number of elapsed iterations
  • rel_Δ::Float64: the final relative Δ
  • converged::Bool: whether the method converged
source
diff --git a/dev/validate.html b/dev/validate.html index 51acff60..2a71d2ea 100644 --- a/dev/validate.html +++ b/dev/validate.html @@ -1,18 +1,18 @@ Evaluation & Validation · Clustering.jl

Evaluation & Validation

Clustering.jl package provides a number of methods to compare different clusterings, evaluate clustering quality or validate its correctness.

Clustering comparison

Methods to compare two clusterings and measure their similarity.

Cross tabulation

Cross tabulation, or contingency matrix, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.

Clustering.jl extends StatsBase.counts() with methods that accept ClusteringResult arguments:

StatsBase.countsMethod
counts(a::ClusteringResult, b::ClusteringResult) -> Matrix{Int}
 counts(a::ClusteringResult, b::AbstractVector{<:Integer}) -> Matrix{Int}
-counts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}

Calculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.

Returns the $n_a × n_b$ matrix C, where $n_a$ and $n_b$ are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.

The clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.

See also

confusion(a::ClusteringResult, a::ClusteringResult) for 2×2 confusion matrix.

source

Confusion matrix

Confusion matrix for the two clusterings is a 2×2 contingency table that counts how frequently the pair of data points are in the same or different clusters.

Clustering.confusionFunction
confusion([T = Int],
+counts(a::AbstractVector{<:Integer}, b::ClusteringResult) -> Matrix{Int}

Calculate the cross tabulation (aka contingency matrix) for the two clusterings of the same data points.

Returns the $n_a × n_b$ matrix C, where $n_a$ and $n_b$ are the numbers of clusters in a and b, respectively, and C[i, j] is the size of the intersection of i-th cluster from a and j-th cluster from b.

The clusterings could be specified either as ClusteringResult instances or as vectors of data point assignments.

See also

confusion(a::ClusteringResult, a::ClusteringResult) for 2×2 confusion matrix.

source

Confusion matrix

Confusion matrix for the two clusterings is a 2×2 contingency table that counts how frequently the pair of data points are in the same or different clusters.

Clustering.confusionFunction
confusion([T = Int],
           a::Union{ClusteringResult, AbstractVector},
-          b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}

Calculate the confusion matrix of the two clusterings.

Returns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.

Considering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:

PositiveNegative
PositiveC₁₁C₁₂
NegativeC₂₁C₂₂

See also

counts(a::ClusteringResult, a::ClusteringResult) for full contingency matrix.

source

Rand index

Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

Clustering.randindexFunction
randindex(a, b) -> NTuple{4, Float64}

Compute the tuple of Rand-related indices between the clusterings c1 and c2.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

Returns a tuple of indices:

  • Hubert & Arabie Adjusted Rand index
  • Rand index (agreement probability)
  • Mirkin's index (disagreement probability)
  • Hubert's index ($P(\mathrm{agree}) - P(\mathrm{disagree})$)

References

Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.

Steinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396

source

Variation of Information

Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.

Clustering.varinfoFunction
varinfo(a, b) -> Float64

Compute the variation of information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

References

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.

source

V-measure

V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity ($h$) and completeness ($c$) of the clustering:

\[V_{\beta} = (1+\beta)\frac{h \cdot c}{\beta \cdot h + c}.\]

Both $h$ and $c$ can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity ($h$) is maximized when each cluster contains elements of as few different classes as possible. Completeness ($c$) aims to put all elements of each class in single clusters. The $\beta$ parameter ($\beta > 0$) could used to control the weights of $h$ and $c$ in the final measure. If $\beta > 1$, completeness has more weight, and when $\beta < 1$ it's homogeneity.

Clustering.vmeasureFunction
vmeasure(a, b; [β = 1.0]) -> Float64

V-measure between the two clusterings.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

The β parameter defines trade-off between homogeneity and completeness:

  • if $β > 1$, completeness is weighted more strongly,
  • if $β < 1$, homogeneity is weighted more strongly.

References

Andrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure

source

Mutual information

Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

Clustering.mutualinfoFunction
mutualinfo(a, b; normed=true) -> Float64

Compute the mutual information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

If normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

References

Vinh, Epps, and Bailey, (2009). Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.

source

Clustering quality indices

clustering_quality() methods allow computing intrinsic clustering quality indices, i.e. the metrics that depend only on the clustering itself and do not use the external knowledge. These metrics can be used to compare different clustering algorithms or choose the optimal number of clusters.

quality indexquality_index optionclustering typebetter qualitycluster centers
Calinski-Harabasz:calinsky_harabaszhard/fuzzyhigher valuesrequired
Xie-Beni:xie_benihard/fuzzylower valuesrequired
Davis-Bouldin:davis_bouldinhardlower valuesrequired
Dunn:dunnhardhigher valuesnot required
silhouettes:silhouetteshardhigher valuesnot required
Clustering.clustering_qualityFunction

For "hard" clustering:

clustering_quality(data, centers, assignments; quality_index, [metric])
+          b::Union{ClusteringResult, AbstractVector}) -> Matrix{T}

Calculate the confusion matrix of the two clusterings.

Returns the 2×2 confusion matrix C of type T (Int by default) that represents partition co-occurrence or similarity matrix between two clusterings a and b by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters.

Considering a pair of samples that is in the same group as a positive pair, and a pair is in the different group as a negative pair, then the count of true positives is C₁₁, false negatives is C₁₂, false positives C₂₁, and true negatives is C₂₂:

PositiveNegative
PositiveC₁₁C₁₂
NegativeC₂₁C₂₂

See also

counts(a::ClusteringResult, a::ClusteringResult) for full contingency matrix.

source

Rand index

Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

Clustering.randindexFunction
randindex(a, b) -> NTuple{4, Float64}

Compute the tuple of Rand-related indices between the clusterings c1 and c2.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

Returns a tuple of indices:

  • Hubert & Arabie Adjusted Rand index
  • Rand index (agreement probability)
  • Mirkin's index (disagreement probability)
  • Hubert's index ($P(\mathrm{agree}) - P(\mathrm{disagree})$)

References

Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193-218

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173-187.

Steinley, Douglas (2004). Properties of the Hubert-Arabie Adjusted Rand Index. Psychological Methods, Vol. 9, No. 3: 386-396

source

Variation of Information

Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.

Clustering.varinfoFunction
varinfo(a, b) -> Float64

Compute the variation of information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

References

Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.

source

V-measure

V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity ($h$) and completeness ($c$) of the clustering:

\[V_{\beta} = (1+\beta)\frac{h \cdot c}{\beta \cdot h + c}.\]

Both $h$ and $c$ can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity ($h$) is maximized when each cluster contains elements of as few different classes as possible. Completeness ($c$) aims to put all elements of each class in single clusters. The $\beta$ parameter ($\beta > 0$) could used to control the weights of $h$ and $c$ in the final measure. If $\beta > 1$, completeness has more weight, and when $\beta < 1$ it's homogeneity.

Clustering.vmeasureFunction
vmeasure(a, b; [β = 1.0]) -> Float64

V-measure between the two clusterings.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

The β parameter defines trade-off between homogeneity and completeness:

  • if $β > 1$, completeness is weighted more strongly,
  • if $β < 1$, homogeneity is weighted more strongly.

References

Andrew Rosenberg and Julia Hirschberg, 2007. V-Measure: A conditional entropy-based external cluster evaluation measure

source

Mutual information

Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

Clustering.mutualinfoFunction
mutualinfo(a, b; normed=true) -> Float64

Compute the mutual information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

If normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

References

Vinh, Epps, and Bailey, (2009). Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.

source

Clustering quality indices

clustering_quality() methods allow computing intrinsic clustering quality indices, i.e. the metrics that depend only on the clustering itself and do not use the external knowledge. These metrics can be used to compare different clustering algorithms or choose the optimal number of clusters.

quality indexquality_index optionclustering typebetter qualitycluster centers
Calinski-Harabasz:calinsky_harabaszhard/fuzzyhigher valuesrequired
Xie-Beni:xie_benihard/fuzzylower valuesrequired
Davis-Bouldin:davis_bouldinhardlower valuesrequired
Dunn:dunnhardhigher valuesnot required
silhouettes:silhouetteshardhigher valuesnot required
Clustering.clustering_qualityFunction

For "hard" clustering:

clustering_quality(data, centers, assignments; quality_index, [metric])
 clustering_quality(data, clustering; quality_index, [metric])

For fuzzy ("soft") clustering:

clustering_quality(data, centers, weights; quality_index, fuzziness, [metric])
 clustering_quality(data, clustering; quality_index, fuzziness, [metric])

For "hard" clustering without specifying cluster centers:

clustering_quality(data, assignments; quality_index, [metric])
 clustering_quality(data, clustering; quality_index, [metric])

For "hard" clustering without specifying data points and cluster centers:

clustering_quality(assignments, dist_matrix; quality_index)
-clustering_quality(clustering, dist_matrix; quality_index)

Compute the quality index for a given clustering.

Returns a quality index (real value).

Arguments

  • data::AbstractMatrix: $d×n$ data matrix with each column representing one $d$-dimensional data point
  • centers::AbstractMatrix: $d×k$ matrix with cluster centers represented as columns
  • assignments::AbstractVector{Int}: $n$ vector of point assignments (cluster indices)
  • weights::AbstractMatrix: $n×k$ matrix with fuzzy clustering weights, weights[i,j] is the degree of membership of $i$-th data point to $j$-th cluster
  • clustering::Union{ClusteringResult, FuzzyCMeansResult}: the output of the clustering method
  • quality_index::Symbol: quality index to calculate; see below for the supported options
  • dist_matrix::AbstractMatrix: a $n×n$ pairwise distance matrix; dist_matrix[i,j] is the distance between $i$-th and $j$-th points

Keyword arguments

  • quality_index::Symbol: clustering quality index to calculate; see below for the supported options
  • fuzziness::Real: clustering fuzziness > 1
  • metric::SemiMetric=SqEuclidean(): SemiMetric object that defines the metric/distance/similarity function

When calling clustering_quality, one can explicitly specify centers, assignments, and weights, or provide ClusteringResult via clustering, from which the necessary data will be read automatically.

For clustering without known cluster centers the data points are not required. dist_matrix could be provided explicitly, otherwise it would be calculated from the data points using the specified metric.

Supported quality indices

  • :calinski_harabasz: hard or fuzzy Calinski-Harabsz index (↑), the corrected ratio of between cluster centers inertia and within-clusters inertia
  • :xie_beni: hard or fuzzy Xie-Beni index (↓), the ratio betwen inertia within clusters and minimal distance between the cluster centers
  • :davies_bouldin: Davies-Bouldin index (↓), the similarity between the cluster and the other most similar one, averaged over all clusters
  • :dunn: Dunn index (↑), the ratio of the minimal distance between clusters and the maximal cluster diameter
  • :silhouettes: the average silhouette index (↑), see silhouettes

The arrows ↑ or ↓ specify the direction of the incresing clustering quality. Please refer to the documentation for more details on the clustering quality indices.

source

The clustering quality index definitions use the following notation:

  • $x_1, x_2, \ldots, x_n$: data points,
  • $C_1, C_2, \ldots, C_k$: clusters,
  • $c_j$ and $c$: cluster centers and global dataset center,
  • $d$: a similarity (distance) function,
  • $w_{ij}$: weights measuring membership of a point $x_i$ to a cluster $C_j$,
  • $\alpha$: a fuzziness parameter.

Calinski-Harabasz index

Calinski-Harabasz index (option :calinski_harabasz) measures corrected ratio between global inertia of the cluster centers and the summed internal inertias of clusters:

\[\frac{n-k}{k-1}\frac{\sum_{C_j}|C_j|d(c_j,c)}{\sum\limits_{C_j}\sum\limits_{x_i\in C_j} d(x_i,c_j)} \quad \text{and}\quad +clustering_quality(clustering, dist_matrix; quality_index)

Compute the quality index for a given clustering.

Returns a quality index (real value).

Arguments

  • data::AbstractMatrix: $d×n$ data matrix with each column representing one $d$-dimensional data point
  • centers::AbstractMatrix: $d×k$ matrix with cluster centers represented as columns
  • assignments::AbstractVector{Int}: $n$ vector of point assignments (cluster indices)
  • weights::AbstractMatrix: $n×k$ matrix with fuzzy clustering weights, weights[i,j] is the degree of membership of $i$-th data point to $j$-th cluster
  • clustering::Union{ClusteringResult, FuzzyCMeansResult}: the output of the clustering method
  • quality_index::Symbol: quality index to calculate; see below for the supported options
  • dist_matrix::AbstractMatrix: a $n×n$ pairwise distance matrix; dist_matrix[i,j] is the distance between $i$-th and $j$-th points

Keyword arguments

  • quality_index::Symbol: clustering quality index to calculate; see below for the supported options
  • fuzziness::Real: clustering fuzziness > 1
  • metric::SemiMetric=SqEuclidean(): SemiMetric object that defines the metric/distance/similarity function

When calling clustering_quality, one can explicitly specify centers, assignments, and weights, or provide ClusteringResult via clustering, from which the necessary data will be read automatically.

For clustering without known cluster centers the data points are not required. dist_matrix could be provided explicitly, otherwise it would be calculated from the data points using the specified metric.

Supported quality indices

  • :calinski_harabasz: hard or fuzzy Calinski-Harabsz index (↑), the corrected ratio of between cluster centers inertia and within-clusters inertia
  • :xie_beni: hard or fuzzy Xie-Beni index (↓), the ratio betwen inertia within clusters and minimal distance between the cluster centers
  • :davies_bouldin: Davies-Bouldin index (↓), the similarity between the cluster and the other most similar one, averaged over all clusters
  • :dunn: Dunn index (↑), the ratio of the minimal distance between clusters and the maximal cluster diameter
  • :silhouettes: the average silhouette index (↑), see silhouettes

The arrows ↑ or ↓ specify the direction of the incresing clustering quality. Please refer to the documentation for more details on the clustering quality indices.

source

The clustering quality index definitions use the following notation:

  • $x_1, x_2, \ldots, x_n$: data points,
  • $C_1, C_2, \ldots, C_k$: clusters,
  • $c_j$ and $c$: cluster centers and global dataset center,
  • $d$: a similarity (distance) function,
  • $w_{ij}$: weights measuring membership of a point $x_i$ to a cluster $C_j$,
  • $\alpha$: a fuzziness parameter.

Calinski-Harabasz index

Calinski-Harabasz index (option :calinski_harabasz) measures corrected ratio between global inertia of the cluster centers and the summed internal inertias of clusters:

\[\frac{n-k}{k-1}\frac{\sum_{C_j}|C_j|d(c_j,c)}{\sum\limits_{C_j}\sum\limits_{x_i\in C_j} d(x_i,c_j)} \quad \text{and}\quad \frac{n-k}{k-1} \frac{\sum\limits_{C_j}\left(\sum\limits_{x_i}w_{ij}^\alpha\right) d(c_j,c)}{\sum_{C_j} \sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}\]

for hard and fuzzy (soft) clusterings, respectively. Higher values indicate better quality.

Xie-Beni index

Xie-Beni index (option :xie_beni) measures ratio between summed inertia of clusters and the minimum distance between cluster centres:

\[\frac{\sum_{C_j}\sum_{x_i\in C_j}d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) } \quad \text{and}\quad \frac{\sum_{C_j}\sum_{x_i} w_{ij}^\alpha d(x_i,c_j)}{n\min\limits_{c_{j_1}\neq c_{j_2}} d(c_{j_1},c_{j_2}) }\]

for hard and fuzzy (soft) clusterings, respectively. Lower values indicate better quality.

Davis-Bouldin index

Davis-Bouldin index (option :davis_bouldin) measures average cohesion based on the cluster diameters and distances between cluster centers:

\[\frac{1}{k}\sum_{C_{j_1}}\max_{c_{j_2}\neq c_{j_1}}\frac{S(C_{j_1})+S(C_{j_2})}{d(c_{j_1},c_{j_2})}\]

where

\[S(C_j) = \frac{1}{|C_j|}\sum_{x_i\in C_j}d(x_i,c_j).\]

Lower values indicate better quality.

Dunn index

Dunn index (option :dunn) measures the ratio between the nearest neighbour distance divided by the maximum cluster diameter:

\[\frac{\min\limits_{ C_{j_1}\neq C_{j_2}} \mathrm{dist}(C_{j_1},C_{j_2})}{\max\limits_{C_j}\mathrm{diam}(C_j)}\]

where

\[\mathrm{dist}(C_{j_1},C_{j_2}) = \min\limits_{x_{i_1}\in C_{j_1},x_{i_2}\in C_{j_2}} d(x_{i_1},x_{i_2}),\quad \mathrm{diam}(C_j) = \max\limits_{x_{i_1},x_{i_2}\in C_j} d(x_{i_1},x_{i_2}).\]

It is more computationally demanding quality index, which can be used when the centres are not known. Higher values indicate better quality.

Silhouettes

Silhouettes metric quantifies the correctness of point-to-cluster asssignment by comparing the distance of the point to its cluster and to the other clusters.

The Silhouette value for the $i$-th data point is:

\[s_i = \frac{b_i - a_i}{\max(a_i, b_i)}, \ \text{where}\]

  • $a_i$ is the average distance from the $i$-th point to the other points in the same cluster $z_i$,
  • $b_i ≝ \min_{k \ne z_i} b_{ik}$, where $b_{ik}$ is the average distance from the $i$-th point to the points in the $k$-th cluster.

Note that $s_i \le 1$, and that $s_i$ is close to $1$ when the $i$-th point lies well within its own cluster. This property allows using average silhouette value mean(silhouettes(assignments, counts, X)) as a measure of clustering quality; it is also available using clustering_quality(...; quality_index = :silhouettes) method. Higher values indicate better separation of clusters w.r.t. point distances.

Clustering.silhouettesFunction
silhouettes(assignments::Union{AbstractVector, ClusteringResult}, point_dists::Matrix) -> Vector{Float64}
 silhouettes(assignments::Union{AbstractVector, ClusteringResult}, points::Matrix;
-            metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}

Compute silhouette values for individual points w.r.t. given clustering.

Returns the $n$-length vector of silhouette values for each individual point.

Arguments

  • assignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)
  • points::AbstractMatrix: if metric is nothing it is an $n×n$ matrix of pairwise distances between the points, otherwise it is an $d×n$ matrix of d dimensional clustered data points.
  • metric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.
  • batch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.

References

Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

source

clustering_quality(..., quality_index=:silhouettes) provides mean silhouette metric for the datapoints. Higher values indicate better quality.

References

Olatz Arbelaitz et al. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition. 46 1: 243-256. doi:10.1016/j.patcog.2012.07.021

Aybükë Oztürk, Stéphane Lallich, Jérôme Darmont. (2018). A Visual Quality Index for Fuzzy C-Means. 14th International Conference on Artificial Intelligence Applications and Innovations (AIAI 2018). 546-555. doi:10.1007/978-3-319-92007-8_46.

Examples

Exemplary data with 3 real clusters.

using Plots, Plots.PlotMeasures, Clustering
+            metric::SemiMetric, [batch_size::Integer]) -> Vector{Float64}

Compute silhouette values for individual points w.r.t. given clustering.

Returns the $n$-length vector of silhouette values for each individual point.

Arguments

  • assignments::Union{AbstractVector{Int}, ClusteringResult}: the vector of point assignments (cluster indices)
  • points::AbstractMatrix: if metric is nothing it is an $n×n$ matrix of pairwise distances between the points, otherwise it is an $d×n$ matrix of d dimensional clustered data points.
  • metric::Union{SemiMetric, Nothing}: an instance of Distances Metric object or nothing, indicating the distance metric used for calculating point distances.
  • batch_size::Union{Integer, Nothing}: if integer is given, calculate silhouettes in batches of batch_size points each, throws DimensionMismatch if batched calculation is not supported by given metric.

References

Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65. Marco Gaido (2023). Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

source

clustering_quality(..., quality_index=:silhouettes) provides mean silhouette metric for the datapoints. Higher values indicate better quality.

References

Olatz Arbelaitz et al. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition. 46 1: 243-256. doi:10.1016/j.patcog.2012.07.021

Aybükë Oztürk, Stéphane Lallich, Jérôme Darmont. (2018). A Visual Quality Index for Fuzzy C-Means. 14th International Conference on Artificial Intelligence Applications and Innovations (AIAI 2018). 546-555. doi:10.1007/978-3-319-92007-8_46.

Examples

Exemplary data with 3 real clusters.

using Plots, Plots.PlotMeasures, Clustering
 X_clusters = [(center = [4., 5.], std = 0.4, n = 10),
               (center = [9., -5.], std = 0.4, n = 5),
               (center = [-4., -9.], std = 1, n = 5)]
@@ -58,4 +58,4 @@
     xaxis = "N clusters", yaxis = "Quality",
     plot_title = "\"Soft\" clustering quality indices",
     size = (700, 350), left_margin = 10pt
-)

Other packages

  • ClusteringBenchmarks.jl provides benchmark datasets and implements additional methods for evaluating clustering performance.
+)

Other packages

  • ClusteringBenchmarks.jl provides benchmark datasets and implements additional methods for evaluating clustering performance.