Partitioning Around Medoids (PAM) in Julia

I’m unable to find any package that implements this algorithm. I’m going to attempt to implement the algorithm myself but I’m not sure if I have the know-how to get it over the finish line, to be perfectly honest. Would anyone else be interested in working on this?

There are a couple of really good resources that discuss the algorithm and they even include pseudocode here:

I fooled around with it for a couple of hours today and came up with this (mess) for the BUILD phase (I’m not sure that it’s right/works):

using Distances

X = rand(3, 5)
D = pairwise(Euclidean(), X, dims=2)

k = 2
N = size(D,1)

# BUILD
m = Int[]

while length(m) < k
    
    td = Float64[]

    for a in 1:N
        medoids = Int[m...]
        push!(medoids,a)
        g = Float64[]
        for i in 1:N
            min_dist = minimum([D[i,medoid] for medoid in medoids])
            push!(g, min_dist)
        end
        push!(td, sum(g))
    end
    push!(m, findfirst(x -> x == minimum(td), td))

end
# SWAP phase

If anyone is interested in working on this, let me know as I’ll be doing the same. Also, if someone successfully implements it before I do, please let me know! :slightly_smiling_face:

2 Likes

if I didn’t have a million other things to be doing right now I definitely would. But in the off-chance you’re still working on this a few days from now, I can probably lend a hand.

1 Like

But in the off-chance you’re still working on this a few days from now

We’ll see :wink:. I started following the pseudo code in the second document instead as I’m finding it a bit easier to understand.

I would recommend the following course of action, which we went through when we develop ParallelKMeans.jl.

  1. Write any version of the algorithm you like. Do not think about optimization or generalization. What you need is a working algorithm, that can run from the beginning to the end.
  2. Since it’s more or less deterministic, find implementation in any other language and build a test set, to verify the correctness of your calculations.
  3. At this stage you can start optimizing the algorithm, performance tips is your best friend here. There are many things that are inefficient now, for example:
    3.1. D initialization is bad for performance, much better to calculate pairwise distances on the fly. It is non allocating and generally faster.
    3.2 Do not generate arrays with Float64[] and push!. Instead, it’s better to allocate vector once with g = Vector{Float64}(undef, N) and then fill it with g[i] = min_dist.
    3.3 Scalars are better than vectors, so if it is possible, try to avoid vectors. For example, if you need a sum of minimum values, it’s faster to update scalar value
g = 0.
for i in 1:N
     min_dist = minimum([D[i,medoid] for medoid in medoids]) # This can be optimized to zero allocations too
     g += min_dist
end
td[a] = g

3.4 Use BecnhmarkToosl.jl, Profiling and/or profile utils ProfileView.jl, ProfileVega.jl to find bottlenecks in your code.

  1. After everything is optimized you can turn your attention to a parallel version of your code. I’ve used manual chunking, but nowadays there are good packages like ThreadsX.jl or Transducers.jl which can make the transition to a parallel version of the code rather simple. But it would make sense to benchmark on relatively large matrices since there is an overhead to using multi-threading, I guess that with less than 10_000 points, a single-threaded version can be faster.

EDIT: Well, first point (regarding D) is not correct, since it looks like it is calculated only once and used many times, unlike k-means where it was recalculated on each iteration. But still, you can use the fact that minimum([D[i,medoid] for medoid in medoids]) is known from the previous step for all values of medoid except the last one. So, instead of searching for minimum over vector, you can compare two values - from the previous step and D[i, a]. This should be very fast.

4 Likes

This is all really excellent advice/information. Thank you very much :slightly_smiling_face:

I think I’ve accomplished this. Here’s the code:

using Distances

X = rand(2, 100)
D = pairwise(Euclidean(), X, dims=2)

k = 4

# BUILD
function build_phase(D,k)
    N = size(D,1)
    total_dists = [sum(D[:,j]) for j in 1:N]

    # initialize medoids with index of object with shortest distance to all others
    medoids = Int[findmin(total_dists)[2]]
    for j in 1:k-1 
        TD = Vector{Float64}(undef,N)
        for a in 1:N
            td = 0.0
            for i in 1:N
                td += minimum(vcat(D[a,i],[D[i,m] for m in medoids]...))
            end
            TD[a] = td
        end
        push!(medoids, findmin(TD)[2])
    end

    return medoids
end

julia> M = build_phase(D,k)
4-element Array{Int64,1}:
 28
 76
 85
 19

# SWAP
function swap_phase(D,k,M)
    # Perform clustering
    assignments = [findmin([D[i,m] for m in M])[2] for i in 1:size(D,1)]

    Mⱼ = Vector{Int}(undef,k)

    # Find minimum sum for each cluster (i.e. find the best medoid)
    for j in 1:k
        distances = sum(D[assignments .== j, assignments .== j][:,i] for i in 1:sum(assignments .== j))
        Mⱼ[j] = findfirst(x -> x == findmin(distances)[2], cumsum(assignments .== j))
    end

    if sort(M) == sort(Mⱼ) 
        return (medoids=Mⱼ,assignments=assignments)
    else
        swap_phase(D,k,Mⱼ)
    end
end

julia> pam_results = swap_phase(D,k,M)
(medoids = [47, 31, 85, 6], assignments = [2, 2, 4, 4, 1, 4, 2, 2, 1, 4  …  1, 1, 3, 3, 2, 3, 1, 1, 4, 3])

Here’s a visual comparison of the results from this algorithm and the version of k-medoids that’s implemented in Clustering.jl:

And, finally, the mean silhouette scores on this run:

julia> pam_score = mean(silhouettes(pam_results.assignments, D))
0.4713220216490007

julia> kmeds_score = mean(silhouettes(clust_results.assignments, D))
0.47234564761469566

k-medoids was marginally better in this case. I guess I should install R and check the output of the pam function available in that language with the output of this to ensure it’s accurate and then, if so, I can start attempting to optimize this :slightly_smiling_face:

1 Like

If anyone wants to contribute, I started a repo:

It would be nice to get this code cleaned up and optimized and either added to Clustering.jl (after making it compatible with that package, of course) or added to the registry as a standalone package.

Also, just to be clear about how horribly inefficient this code is :laughing::

julia> @btime pam($D,$k)
  5.197 ms (184206 allocations: 10.92 MiB)
(medoids = [46, 70, 84, 69], assignments = [2, 2, 4, 1, 3, 1, 3, 3, 1, 1  …  3, 1, 2, 1, 4, 4, 4, 4, 4, 2])

julia> @btime kmedoids($D,$k)
  8.100 μs (55 allocations: 27.31 KiB)
KmedoidsResult{Float64}([27, 37, 11, 95], [2, 2, 4, 3, 1, 4, 1, 1, 3, 3  …  1, 3, 2, 2, 4, 4, 4, 4, 4, 2], [0.3380151599832644, 0.200109974021987, 0.1846786187683679, 0.14954767553527495, 0.16738605596370776, 0.2081816279228162, 0.1228050196903722, 0.08344742216114753, 0.08580220639346338, 0.11769923453023014  
…  0.2936940431576039, 0.2882888866684584, 0.19042900853117145, 0.2626131861910653, 0.0, 0.35390494465568934, 0.20319969883135353, 0.11001061668464789, 0.15650986531407027, 0.05093615048340031], [22, 26, 20, 32], 18.065680820694467, 2, true)

UPDATE: I’ve been doing more comparisons and it seems to be a toss-up. Sometimes this PAM implementation does better, sometimes kmedoids from Clustering.jl does better, at least when judging by mean silhouette scores. I guess the only advantage to PAM is that it’s deterministic, which actually may be a real benefit for my use case.

2 Likes

@mthelm85
If the code is clean enough to share I recommend taking it to Slack and going to the @performance-helpdesk chat and asking for pointers :). I could probably help a little but the people in there can teach you some awesome stuff.

Also great job. One small recommendation I have is - when there isn’t a rigorous/analytic test to make, compare a couple empirical datasets. At the end of the day, exact matches aren’t required, but they can indicate issues early on before someone tries to scale up:).

1 Like

I’ve been comparing results against the pam function from R’s clustering package and the results are almost always the same. When they are not the same (i.e. when the medoids are different) the version I’ve implemented here almost always results in better mean silhouette scores. I’m not going to dig into the R implementation to figure out what’s different - I think I’m going to start optimizing this version and later I’ll do a more rigorous comparison and post results.

1 Like

Thanks to the amazing people on the Slack channel, significant optimizations have been made (though there are plenty more to be had, undoubtedly):

julia> @btime pam($D,$k)
  168.500 μs (546 allocations: 209.33 KiB)
(medoids = [38, 41, 96, 100], assignments = [3, 3, 2, 3, 2, 3, 3, 1, 1, 4  …  4, 2, 3, 1, 4, 3, 1, 3, 2, 4])
1 Like

I’ve added a decent README to the package as well as some more rigorous comparisons of clustering quality that you can check out at the repo:

To sum it up, PAM.jl does a better job of k-medoids clustering than Clustering.jl (at the expense of performance) but it’s not as good as what’s been implemented in R’s cluster library.

I don’t think I’m going to submit this to the registry; at least not yet. I’ll open an issue at Clustering.jl to see if they are interested in adapting the code to work with that package and then go from there.

2 Likes