Hi, I need some guidance on how to handle the readability of dendrograms plotted using thousands of observations. I want to use to dendrogram to decide how many clusters I should use for further analysis.
My current strategy is to collapse the branches so that there are only a limited number of leaves, say 10. Ideally, the number under the leaf would be the quantity of observations contained therein. I would then want to print out the indices from the data within that leaf. I have seen similar functionality in Matlab and in the truncate options of R, but I can’t seem to find anything in Julia. Can it be done by a novice such as myself? Here’s a simple working example:
using Clustering, Random, Distances, StatsPlots
# Set seed for reproducibility
Random.seed!(42)
# Generate a random dataset with 1000 observations and 8 attributes
num_observations = 1000
num_attributes = 8
data = rand(num_observations, num_attributes) # Random values between 0 and 1
# Compute the pairwise distance matrix (Euclidean distance)
distance_matrix = pairwise(Euclidean(), data, dims=1)
# Perform hierarchical clustering
hclust_result = hclust(distance_matrix, linkage=:ward)
plot(hclust_result)
The resulting plot is shown below. Could anyone help me to implement my plan or convince me that there’s a better way? Yes, Im new to Julia and new to discourse.julialang, thank you so much for your help. Cheers!