How to Collapse Dendrogram Branches to n Number of Leaves

Hi, I need some guidance on how to handle the readability of dendrograms plotted using thousands of observations. I want to use to dendrogram to decide how many clusters I should use for further analysis.

My current strategy is to collapse the branches so that there are only a limited number of leaves, say 10. Ideally, the number under the leaf would be the quantity of observations contained therein. I would then want to print out the indices from the data within that leaf. I have seen similar functionality in Matlab and in the truncate options of R, but I can’t seem to find anything in Julia. Can it be done by a novice such as myself? Here’s a simple working example:

using Clustering, Random, Distances, StatsPlots

# Set seed for reproducibility
Random.seed!(42)

# Generate a random dataset with 1000 observations and 8 attributes
num_observations = 1000
num_attributes = 8
data = rand(num_observations, num_attributes)  # Random values between 0 and 1

# Compute the pairwise distance matrix (Euclidean distance)
distance_matrix = pairwise(Euclidean(), data, dims=1)

# Perform hierarchical clustering
hclust_result = hclust(distance_matrix, linkage=:ward)

plot(hclust_result)

The resulting plot is shown below. Could anyone help me to implement my plan or convince me that there’s a better way? Yes, Im new to Julia and new to discourse.julialang, thank you so much for your help. Cheers!

Ok, this seems to work … and here is how I found it:

julia> hclust_result |> typeof
Hclust{Float64}

julia> methodswith(Hclust)
[1] getproperty(hclu::Hclust, prop::Symbol) @ Clustering ~/.julia/packages/Clustering/M6mjF/src/deprecate.jl:25
[2] propertynames(hclu::Hclust) @ Clustering ~/.julia/packages/Clustering/M6mjF/src/deprecate.jl:20
[3] propertynames(hclu::Hclust, private::Bool) @ Clustering ~/.julia/packages/Clustering/M6mjF/src/deprecate.jl:20
[4] cutree(hclu::Hclust; k, h) @ Clustering ~/.julia/packages/Clustering/M6mjF/src/hclust.jl:810

help?> cutree
search: cutree hclust_result ClusteringResult AbstractUnitRange

  cutree(hclu::Hclust; [k], [h]) -> Vector{Int}

...

julia> foo = cutree(hclust_result; k = 5)
1000-element Vector{Int64}:
 1
 2
 1
 1
 3

# Seems to return the cluster ID of each sample ...
# Fun (but inefficient) APL-like one-liner to get sets of indices for each
julia> getindex.(Ref(eachindex(foo)), eachrow(unique(foo) .== foo'))
5-element Vector{Vector{Int64}}:
 [1, 3, 4, 10, 11, 13, 14, 20, 22, 26  …  939, 944, 951, 958, 965, 972, 981, 982, 989, 1000]
 [2, 7, 9, 18, 21, 25, 27, 29, 32, 34  …  975, 977, 979, 980, 983, 985, 987, 996, 997, 998]
...
1 Like

I was aware of cutree, but you helped me understand it better. Thanks! Though, I’m trying to plot the resulting “cut” dendrogram. I’m afraid there is not an easy way yet…?

Ok, not sure if there is a builtin way. How would you like the plot to look like, i.e., how would you do it in R?
A quick and dirty way might be to just limit the y-scale, i.e., sort of like cutting at that height: plot(hclust_result; ylim = (4.3, Inf))