Distance matrix + clustering with custom distance function

vshesh · December 18, 2020, 5:59am

Hi I’m new to julia and trying to figure out the best way to compute some distance matrices and evaluate how good my clustering is…

I have a custom distance function like this:

dist(w::Array{<:Real,1}, p::Int64 = 2) = (x, y) -> sum( w'.(min.(abs.(x-y), abs.((x-y) + repeat([0, 360, 180, 0, 0, 0, 0, 360],6)))).^p)^(1/p)

The reason its complicated is that it’s a distance function where some of the variables are periodic and some of the variables are on the real number line. Otherwise it’s a weighted minkowski norm. Also, the periodicity is 360 in some cases and 180 in others.

I have data in CSV files that I want to load for various classes, and I do so like this:

> data = Dict(splitext(basename(f))[1] => CSV.File(f) for f = Glob.glob("*.csv", "./sample-data/"))

Dict{String,CSV.File{false}} with 12 entries:
  "class1" => CSV.File("./sample-data/class1.csv"):…
  "class2"   => CSV.File("./sample-data/class2.csv"):…
  "class3" => CSV.File("./sample-data/class3.csv"):…
 ... etc

So far so good…

now I want to do things like:

compute a distance matrix between two of the classes. I tried Distances.jl; didn’t see a way to run a custom distance function in the documentation.
compute a distance matrix for all the data (~2000 rows, 48 features; would be a 2000x2000 distance matrix) so I can see whether my distance function accurately recovers the classes (elements of one class are “closer” than elements of other classes)
try clustering (agglomerative or density based) with this distance function and validate that the clusters I find are sensible and do not contain mixed data. meaning that I have a strong idea of what “distance” means in this data set
some kind of “search” to find good weights for the distance function

I will be iterating on the distance function so speed for the distance matrix would be helpful! (I get that nonstandard distance = have to figure this out myself, not clear how though).

I’m getting hung up on simple things like:

julia> dist(ones(48))(data["class1"][2], data["class2"][1])
ERROR: MethodError: no method matching -(::CSV.Row, ::CSV.Row)
Closest candidates are:
  -(::ChainRulesCore.DoesNotExist, ::Any) at /Users/Vishesh/.julia/packages/ChainRulesCore/PUnER/src/differential_arithmetic.jl:25
  -(::ChainRulesCore.Zero, ::Any) at /Users/Vishesh/.julia/packages/ChainRulesCore/PUnER/src/differential_arithmetic.jl:65
  -(::DataValues.DataValue{T1}, ::T2) where {T1, T2} at /Users/Vishesh/.julia/packages/DataValues/N7oeL/src/scalar/operations.jl:65
  ...
Stacktrace:
 [1] (::var"#56#57"{Array{Float64,1},Int64})(::CSV.Row, ::CSV.Row) at ./REPL[291]:1
 [2] top-level scope at REPL[311]:1

what? Also, another test like:

julia> Distances.pairwise(Distances.Euclidean(), data["class1"])
ERROR: MethodError: no method matching pairwise(::Type{Distances.Euclidean}, ::CSV.File{false})
Stacktrace:
 [1] top-level scope at REPL[313]:1
julia> Distances.pairwise(Distances.Euclidean(), data["class1"] |> Tables.matrix) # works
49×49 Array{Float64,2}:
... numbers

When I try my distance function it doesn’t work:

julia> Distances.pairwise(dist(ones(48)), data["class1"] |> Tables.matrix)
ERROR: MethodError: no method matching pairwise(::var"#56#57"{Array{Float64,1},Int64}, Array{Float64, 2}; dims)
Stacktrace:
 [1] top-level scope at REPL[320]:1
julia> Distances.pairwise((x,y) -> 1, data["class1"] |> Tables.matrix)
ERROR: MethodError: no method matching pairwise(::var"#66#67", ::Array{Float64,2})
Closest candidates are:
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T; dims) at ...
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T, ::AbstractArray{T,2} where T; dims) at ...
 [1] top-level scope at REPL[325]:1

I figured I’m travelling the wrong path here and it would be smarter to ask:

do I need to write my own distance matrix function or is there something fast out there I can use
is my distance function optimized? is there some more optimal way to write it?

Oscar_Smith · December 18, 2020, 6:08am

For distance matrix using Distances.jl, you just want R = pairwise(dist, X, Y, dims=2) where dist is your custom function.

vshesh · December 18, 2020, 6:29am

julia> Distances.pairwise(dist(ones(48)), data["class1"], data["class2"], dims=2)
ERROR: MethodError: no method matching pairwise(::var"#56#57"{Array{Float64,1},Int64}, ::CSV.File{false}, ::CSV.File{false}; dims=2)
Stacktrace:
 [1] top-level scope at REPL[343]:1

julia> Distances.pairwise(dist(ones(48)), data["class1"] |> Tables.matrix, data["class2"] |> Tables.matrix, dims=2)
ERROR: MethodError: no method matching pairwise(::var"#56#57"{Array{Float64,1},Int64}, ::Array{Float64,2}, ::Array{Float64,2}; dims=2)
Closest candidates are:
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T, ::AbstractArray{T,2} where T; dims) 
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T; dims) 
Stacktrace:
 [1] top-level scope at REPL[344]:1

is what I got when I tried that.

Oscar_Smith · December 18, 2020, 6:46am

you want your data as a Vector of Vectors rather than as a Matrix.

vshesh · December 18, 2020, 7:31am

doesn’t seem to be it?

julia> Distances.pairwise((x,y) -> 1, map(collect, data["class1"]))
ERROR: MethodError: no method matching pairwise(::var"#76#77", ::Array{Array{Any,1},1})
Stacktrace:
 [1] top-level scope at REPL[383]:1

julia> Distances.pairwise((x,y) -> 1, map(collect,data["class1"]), map(collect, data["class1"]))
ERROR: MethodError: no method matching pairwise(::var"#78#79", ::Array{Array{Any,1},1}, ::Array{Array{Any,1},1})
Stacktrace:
 [1] top-level scope at REPL[384]:1

julia> Distances.pairwise((x,y) -> 1, map(collect, data["class1"]), map(collect, data["class1"]), dims=2)
ERROR: MethodError: no method matching pairwise(::var"#80#81", ::Array{Array{Any,1},1}, ::Array{Array{Any,1},1}; dims=2)
Stacktrace:
 [1] top-level scope at REPL[386]:100

vshesh · December 18, 2020, 6:53pm

pairwise takes (::Distances.PreMetric, ::AbstractArray{T,2})
I think the question is how to register a custom metric as a PreMetric so the function compiles, or alternatively how to write a version of pairiwise that dispatches on any function that takes two vectors

Here’s as far as I got, based on what I see in Distances.jl:

struct WeightedPeriodicMinkowski{T, W, P <: Real} <: Distances.UnionMetric 
    periods: T
    weights: W
    p: P
end

# no idea how to do this part - is this right?
Distances.parameters(wpm::WeightedPeriodicMinkowski) = zip(wpm.periods, wpm.weights)

@inline function Distances.eval_op(d::WeightedPeriodicMinkowski, ai, bi, (Ti, wi))
    # Taken from the PeriodicEuclidean function 
    s1 = abs(ai - bi)
    s2 = mod(s1, Ti)
    # taken from the WeightedMinkowski function
    abs(min(s2, Ti - s2))^d.p * wi
end
@inline Distances.eval_end(d::WeightedPeriodicMinkowski, s) = s^(1/d.p)
wpminkowski(a, b, w, T, p) = WeightedPeriodicMinkowski(T, w, p)(a, b)

When I try:

julia> wpminkowski([1,2], [3,4], [1,1], [2,2], 2)
ERROR: MethodError: objects of type WeightedPeriodicMinkowski{Array{Int64,1},Array{Int64,1},Int64} are not callable

So I’m very lost.

vshesh · December 18, 2020, 8:43pm

Here’s what I had to do to get this working (basically copy Distances.jl’s internals and make changes):
Any ideas on how this could be accomplished faster/easier, please tell me! I’m really surprised that there is no simpler way to do this… I got lucky that the metric was a type that I could cobble together from the ones presented in the Distances.jl documentation, but I’m still not sure what I will do if the Distance function I try later is not of this type - write this whole thing again?

struct WeightedPeriodicMinkowski{T, W, P <: Real} <: Distances.UnionMetric 
    weights:: W
    periods:: T
    p:: P
end

Distances.parameters(wpm::WeightedPeriodicMinkowski) = (wpm.periods, wpm.weights)

@inline function Distances.eval_op(d::WeightedPeriodicMinkowski, ai, bi, Ti, wi)
    # Taken from the PeriodicEuclidean function 
    s1 = abs(ai - bi)
    s2 = mod(s1, Ti)
    # taken from the WeightedMinkowski function
    abs(min(s2, Ti - s2))^d.p * wi
end
@inline Distances.eval_end(d::WeightedPeriodicMinkowski, s) = s^(1/d.p)
wpminkowski(a, b, w, T, p) = WeightedPeriodicMinkowski(T, w, p)(a, b)
(w::WeightedPeriodicMinkowski)(a,b) = Distances._evaluate(w, a, b)

Distances.result_type(dist::Distances.UnionMetrics, ::Type{Ta}, ::Type{Tb}, (p1, p2)) where {Ta,Tb} =
    typeof(Distances._evaluate(dist, oneunit(Ta), oneunit(Tb), oneunit(eltype(p1)), oneunit(eltype(p2))))

function Distances._evaluate(dist::Distances.UnionMetrics, a::Number, b::Number, p1::Number, p2::Number)
    Distances.eval_end(dist, Distances.eval_op(dist, a, b, p1, p2))
end 
    
Base.@propagate_inbounds function Distances._evaluate(d::Distances.UnionMetrics, a::AbstractArray, b::AbstractArray, (p1, p2)::Tuple{AbstractArray, AbstractArray})
    @boundscheck if length(a) != length(b)
        throw(DimensionMismatch("first array has length $(length(a)) which does not match the length of the second, $(length(b))."))
    end
    @boundscheck if length(a) != length(p1)
        throw(DimensionMismatch("arrays have length $(length(a)) but parameter 1 has length $(length(p1))."))
    end
    @boundscheck if length(a) != length(p2)
        throw(DimensionMismatch("arrays have length $(length(a)) but parameter 2 has length $(length(p2))."))
    end
    if length(a) == 0
        return zero(result_type(d, a, b))
    end
    @inbounds begin
        s = eval_start(d, a, b)
        if (IndexStyle(a, b, p1, p2) === IndexLinear() && eachindex(a) == eachindex(b) == eachindex(p1)) == eachindex(p2)||
                axes(a) == axes(b) == axes(p) == axes(p2)
            @simd for I in eachindex(a, b, p1, p2)
                ai = a[I]
                bi = b[I]
                p1i = p1[I]
                p2i = p2[I]
                s = eval_reduce(d, s, eval_op(d, ai, bi, p1i, p2i))
            end
        else
            for (ai, bi, p1i, p2i) in zip(a, b, p1, p2)
                s = eval_reduce(d, s, eval_op(d, ai, bi, p1i, p2i))
            end
        end
        return eval_end(d, s)
    end
end

Then I am finally able to write:
Distances.pairwise(WeightedPeriodicMinkowski(repeat([Inf, 360, 180, Inf, Inf, Inf, Inf, 360],6), ones(48), 2), data["class1"] |> Tables.matrix |> transpose) and get results.

Topic		Replies	Views
Dbscan clustering with distance matrix General Usage clustering	6	1986	December 7, 2021
How to upgrade code for a symmetric distance matrix from julia-1.0 to julia-1.6 General Usage	17	979	March 3, 2022
HELP: Implementing K-means from scratch with Julia General Usage question	4	1485	February 6, 2020
Symmetric matrix is not symmetric General Usage question , linearalgebra , distances	2	745	February 17, 2021
Pairwise distances from a single column or vector New to Julia	7	3927	October 2, 2019

Distance matrix + clustering with custom distance function

Related topics