I’m trying to use Clustering.jl.
I hope to obtain result values of silhouettes() from KmeansResult.
I understand that silhouettes() requires assignments, counts, dists,
Assignments and counts are referred from KmeansResult.
However, I cannot catch how to make dists.
If you know this, please help me.
Sample code is very helpful.
So the code used in kmeans.jl looks like this
dmat = pairwise(distance, centers, x) ## with distance = SqEuclidean()
dmat = convert(Array{T}, dmat) ## T :< AbstractFloat
Then from wikipedia
For each datum i {\displaystyle i} , let a ( i ) {\displaystyle a(i)} be the average distance between i {\displaystyle i} and all other data within the same cluster.
We then define the average dissimilarity (difference / variance, according to wikipedia) of point i {\displaystyle i} to a cluster c {\displaystyle c} as the average of the distance from i {\displaystyle i} to all points in c {\displaystyle c} .
I’m pretty sure we can cheat here a little and take the distance from a point to a cluster as the average distance from the point to all points in that cluster, it may only be a bit inaccurate for the cluster it belongs to generally?
for a bit of fun
using Statistics
function short_silho(x, cents, inds)
n = size(x,2);
c = size(cents,2);
icmdist = zeros(n);
cdists = zeros(n, c);
xvec = falses(n);
maxval = typemax(eltype(x));
for i = 1:c
xvec .= inds .== i;
icmdist[xvec] = intracluster(x[:, xvec]);
end;
dists = pairwise(SqEuclidean(), x, cents);
for j = 1:n
dists[j, inds[j]] = maxval;
end;
x2,y = findmin(dists, dims=2);
inds2 = vec(map(x3->x3[2], y));
(icmdist, x2, inds2);
end;
## est_silho = (icmdist .- x2) ./ max.(icmdist, x2);
A little off-track, but here’s how to do the intra-cluster average. The silhoettes() function may work with something similar, possibly just all cross all for a n*n matrix
using Distances;
function intracluster(x) ## expects column data
sum(pairwise(SqEuclidean(), x, x), dims=2) ./ (size(x,2) -1);
end;
Thank you for your reply.
Probably I can make a dmat in my code.
However,
silhouettes(results.assignments, results.counts, dmat)
will get error message as:
DimensionMismatch(“The size of a distance matrix ((645, 40)) doesn’t match the length of assignment vector (645).”)
What’s wrong?
Yes, so dmat needs to be a point x point pairwise distance matrix, (645, 645). source - silhoette.jl
How did you make dmat
?
pointdata = [row1; row2; ...]
dmat = pairwise(SqEuclidean(), pointdata', pointdata');
I’m using a data set, which is 4 conditions x 645 sample points.
This is my code.
data_matrix = collect(transpose(convert(Array{Float64,2}, df[2:size(df)[2]])))
4×645 Array{Float64,2}:
-1.01588 -1.3268 -1.84882 1.7744 … -1.47669 -0.982057 -1.01758
-1.21613 -1.79652 -1.7698 2.03358 -1.3536 -1.30077 -1.08561
-1.12921 -1.61318 -1.68198 2.68172 -1.35968 -1.19259 -0.926614
-1.11933 -1.51307 -1.78798 1.96529 -1.22089 -1.35186 -1.15993
using Distances
distance = SqEuclidean()
dmat = pairwise(distance, results.centers, data_matrix)
dmat = convert(Array{T} where T <: AbstractFloat, dmat)
40×645 Array{Float64,2}:
2.77737 0.717969 0.171259 … 2.25724 3.2921
33.5094 44.5914 50.2691 35.5992 31.8187
0.184876 1.78681 3.00827 0.40055 0.0972948
24.0947 33.6019 38.6445 25.8204 22.6943
174.776 198.914 210.822 179.279 171.226
0.0184391 0.690746 1.53246 … 0.0325261 0.0728016
43.5018 56.1165 62.1183 45.9888 41.511
0.388352 0.331825 0.787307 0.373377 0.575925
28.4421 19.8494 16.2553 26.6771 30.032
93.3091 110.807 120.429 96.6973 90.7628
81.1399 98.1616 105.893 … 84.4855 78.4354
180.118 204.812 216.627 184.784 176.043
1.3397 0.229281 0.0684447 1.0595 1.71681
⋮ ⋱
121.13 141.542 151.434 125.021 118.017
64.5956 79.7279 87.0569 67.5645 62.2097
227.882 255.192 269.258 … 233.022 223.734
7.2145 3.74843 2.12279 6.54258 8.01459
24.7444 34.4854 39.2543 26.6141 23.2747
217.84 244.651 257.916 222.85 213.902
0.071173 1.23292 2.1302 0.2402 0.0621958
5.76897 2.65864 1.2775 … 5.25106 6.41967
9.15671 4.65328 2.98532 8.18652 10.0723
106.857 125.937 135.134 110.715 103.776
14.7913 8.98791 6.43891 13.5623 15.8976
49.4212 62.6724 69.4475 51.9042 47.365
It wants to use the distances between each pair of sample points, I believe
dmat = pairwise(distance, data_matrix) ## Equivalent to pairwise(distance, data_matrix, data_matrix)
dmat = convert(Array{T} where T <: AbstractFloat, dmat)
size(dmat)
> (645, 645)
I got it! Now I can get silhouettes() results!
I did not understand these functions how to work…
Anyway, thank you y4lu!
Your help solve my problem.
I’ll read your previous post with additional descriptions.
Probably, I still not well understand what things you describe there…