Statsplots dendrogram - how to have clustered coloring

I’m new to Julia, but already impressed by it’s syntax and by statsplots. However, some functionality is still missing that I miss from Python/R.

Specifically for this post, I would like to color-code the output of clustering.hclust, as is seen in python/R, so that it looks more like this:

Here is a minimally working example to get a distance matrix and dendrogram with group behavior.

using Clustering
using StatsPlots
using Distances

#Generate a random dataset, but with 10 correlated structures
data_matrix = rand(50,1000)
for i=1:10
    data_matrix[:, 1:i*100] .+= .25
end

#Generate distance matrix, perform hierarchical clustering, plot
dist_mat = pairwise(Euclidean(1e-12), data_matrix, dims=2)
hcl1 = hclust(dist_mat, linkage=:ward)
dg = plot(hcl1, xticks=false, yticks=true)
plot!(size=(800,200))

I am using the GR backend.

Considering I’m just starting to use julia, is this problem worth solving in pure julia? Is there a better way to approach this? e.g., use a different plotting package/backend, or import python/R functions and not waste time?

Now, here is how far I’ve gotten, and the direction of “pure julia” to get this to work. However, accessing the line elements seems really unintuitive and was very difficult to even find.

function dendrocolor(dg)
    for i=1:10000
        try
            dg[1][i].plotattributes[:linecolor] = RGBA{Float64}(0.8888735002725198,0.43564919034818994,0.2781229361419438,1.0)
        catch e
            break
        end
    end
end

This will change the entire dendrogram orange. I’m thinking I could

  • Use this function and call each line separately in the dendrogram
  • leverage hcl1 structure which contains merges and height
  • use the clustering.cutree function to accomplish this

However, I don’t know where to start parsing all that together, and it seems like a lot of work. Is there a better way?

All advice and help is appreciated!
Clay

You want some nodes/lines to have different colors, and the user to specify those colors by themselves. You can do that by passing a group argument to plot when you plot a Hclust object. Try passing a random group vector of e.g. 4 different integers to plot, eg dg = plot(hcl1, group = my_random_vector) and see what happens. The right length I guess should be the same as the number of nodes in the dendrogram. Then the task reduces to finding the right way/order of specifying the group vector.

ok, so if I use the cutree function I can get the cluster integers as a vector that matches the number of branches (1000), and that’s what I would expect should go in the group argument.

cluster_idxs = cutree(hcl1, k=9)
dg = plot(hcl1, xticks=false, yticks=true, group=cluster_idxs)

This error is returned though, so I guess I don’t understand what this argument is looking for.

ERROR: BoundsError: attempt to access 4-element Array{Float64,1} at index [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  192, 193, 194, 195, 196, 197, 198, 199, 200, 260]]
Stacktrace:
 [1] throw_boundserror(::Array{Float64,1}, ::Tuple{Array{Int64,1}}) at .\abstractarray.jl:541
 [2] checkbounds at .\abstractarray.jl:506 [inlined]
 [3] _getindex at .\multidimensional.jl:742 [inlined]
 [4] getindex at .\abstractarray.jl:1060 [inlined]
 [5] filter_data(::Array{Float64,1}, ::Array{Int64,1}) at C:\Users\Clay\.julia\packages\RecipesPipeline\uPBKQ\src\group.jl:46
 [6] filter_data!(::Dict{Symbol,Any}, ::Array{Int64,1}) at C:\Users\Clay\.julia\packages\RecipesPipeline\uPBKQ\src\group.jl:51
 [7] _filter_input_data!(::Dict{Symbol,Any}) at C:\Users\Clay\.julia\packages\RecipesPipeline\uPBKQ\src\group.jl:58
 [8] _finish_userrecipe!(::Plots.Plot{Plots.GRBackend}, ::Array{Dict{Symbol,Any},1}, ::RecipeData) at C:\Users\Clay\.julia\packages\RecipesPipeline\uPBKQ\src\user_recipe.jl:104
 [9] _process_userrecipes!(::Any, ::Any, ::Any) at C:\Users\Clay\.julia\packages\RecipesPipeline\uPBKQ\src\user_recipe.jl:34
 [10] recipe_pipeline!(::Any, ::Any, ::Any) at C:\Users\Clay\.julia\packages\RecipesPipeline\uPBKQ\src\RecipesPipeline.jl:70
 [11] _plot!(::Plots.Plot, ::Any, ::Any) at C:\Users\Clay\.julia\packages\Plots\vsE7b\src\plot.jl:172
 [12] #plot#129 at C:\Users\Clay\.julia\packages\Plots\vsE7b\src\plot.jl:58 [inlined]
 [13] top-level scope at none:1

I also made a Array{Array{Int64,1},1} of arbitrary indices in 9 groups that also didn’t work.

Could you be more explicit in what “group=” expects?

Ah OK, so something in the implementation of the recipe is thwarting the group argument - @kevbonham I think you wrote that recipe - can you easily see how it could be modified to handle group inside?

Oof - sorry I didn’t see this until just now :scream:. Somehow the notification got buried.

It’s been so long I don’t remember the implementation exactly, but I will try taking a look this week. I think it would be awesome to have this functionality, but I don’t think it will be trivial. @alailink would you please open an issue on Stats Plots and tag me (I’m @kescobo on GitHub)? I’m less likely to lose track of it there. Sorry again for taking so long to get back to you

1 Like

thanks for taking an interest in this problem! I opened an issue. Let me know if I can help you develop it at all! I’m new to Julia but I’ve programmed in matlab/python/R.

Also, as a quick side note, graphing in Julia feels so slow. I’m using Atom+Juno+Gre. A large heatmap takes ages to load on my fast computer, and “quick” graphs take seconds where matlab/python would be instant. Any advice?

@kevbonham I really hope it shouldn’t be too involved, I guess the way grouping is implemented for groupedboxplot could be used essentially?

@alailink that’s really strange, most plots should take around 100 ms to produce - which is not instantaneous but should be fast enough. Could be a Juno thing? Atom Juno isn’t really developed anymore

Unless we’re talking about the first plot of course. It’s better than it was, but still slow for someone coming from matlab/python/R

I don’t think so - the plot object is sort of a custom construct. Will discuss more in the issue when I’ve had a chance to look.

Is there resolution how to add colors?!

Unfortunately this issue was surprisingly complicated. It has been moved to a longer-term project as a github PR here: https://github.com/JuliaPlots/StatsPlots.jl/pull/411
and statsmodels issue here: https://github.com/JuliaPlots/StatsPlots.jl/issues/408#issuecomment-742609986

Yeah, it’s a bit of a beast. Definitely will be worth having though. Don’t feel like you need to wait for something that’s working - commit and push away - all of that history will be squashed before it’s merged into the main branch. Assuming you’ve done anything - also totally understandable if you haven’t :smile:.

If some people still have this problem, i’ve got an alternative solution. You can color the background of the different cluster. And this solution also show the height where the dendogram have been cut !

using Clustering, StatsPlots

D = [ 0   9  3  6  11
      9   0  7  5  10
      3   7  0  9   2
      6   5  9  0   8
      11  10 2  8   0]
hcl1 = hclust(D, linkage=:complete, branchorder=:r)

d=7.0
y_hat = Int16.(cutree(hcl1,h=d))

extremums = Vector(undef,length(unique(y_hat)))
for (k,i) in enumerate(unique(y_hat))
    extremums[k] = extrema([findfirst(x -> x .== j,hcl1.order) for j in findall(x -> x ==i,y_hat)]) .+ (-0.5,0.5)
end
plot([Shape([m,M,M,m],[0,0,d,d]) for (m,M) in extremums]; alpha = 0.5)
plot!(hcl1, xticks = (1:length(y_hat),string.(hcl1.order)), xlabel="node", ylabel=string(hcl1.linkage) * " linkage distance")

From a clustering prediction y_hat you obtain this kind of plot

2 Likes