Real-number metric for quantifying the quality of a clustering output

Datseris · August 2, 2023, 8:54am

Hi all,

Given some data, and the output of DBSCAN applied to this data (or any other clustering algorithm really), I need a way to assign a real number that quantifies the “quality” of the DBSCAN clustering.

I am wondering if any of you have any idea regarding this, and whether you can point me to an already existing Julia implementation.

What I thought of doing: let’s use the following example which was the output of DBSCAN applied to 3-dimensional data:

I am thinking: first, compute the 3D hisogram of these data (each cluster gets its own histogram). Then, fit gaussians to each histograms. Then, calculate the overlap of all gaussians with all other gaussians: the smallest the total overlap, the better the cluster quality (i.e., the more the clusters are separated in space).

The context of my question is: I have several data dimensions that I could be using for the clustering algorithm, and it is of scientific relevance to learn which of these data dimensions would yield the better clustering. E.g., I have 1000 20-dimensional points, but I cluster using only 3 out of these 20 dimensions. Which 3 give the best clustering quality?t

Rudi79 · August 2, 2023, 9:23am

AFAIK this is usually done with Silhouette (clustering) - Wikipedia.
It basically compares the inter-cluster distances to the intra-cluster distances.
An implemention is available in
GitHub - JuliaStats/Clustering.jl: A Julia package for data clustering

The last part of your post suggest that you are also interested in comparing different clusterings.
In that case I suggest having a look at
https://publikationen.bibliothek.kit.edu/1000011477/812079 (Comparing Clusterings - An Overview)

juliohm · August 2, 2023, 9:37am

Notice that many clustering algorithms are based on such clustering scores. DBSCAN for example uses its own definition of score to assign labels to points as core points, edge points, etc. You should pick the clustering algorithm based on the definition of “quality” you expect.

This algorithm is a special case of the general EM algorithm When you have latent Gaussian variables and apply the EM algorithm you get K-means clustering. Other clustering procedures can be obtained with this formalism and you can take a look at packages for EM in Julia, they probably exist.

Topic		Replies	Views
Clustering_quality not defined General Usage cluster	1	80	May 15, 2024
Silhouette coefficient calculation Performance performance , profiling , clustering	5	668	November 7, 2022
Dbscan clustering with distance matrix General Usage clustering	6	2118	December 7, 2021
DBSCAN clustering with Haversine metric Geo clustering	4	161	October 24, 2024
Question about optimal number of clusters General Usage question	3	697	August 23, 2020

Real-number metric for quantifying the quality of a clustering output

Related topics