Distance matrix + clustering with custom distance function

Hi I’m new to julia and trying to figure out the best way to compute some distance matrices and evaluate how good my clustering is…

I have a custom distance function like this:

dist(w::Array{<:Real,1}, p::Int64 = 2) = (x, y) -> sum( w'.(min.(abs.(x-y), abs.((x-y) + repeat([0, 360, 180, 0, 0, 0, 0, 360],6)))).^p)^(1/p)

The reason its complicated is that it’s a distance function where some of the variables are periodic and some of the variables are on the real number line. Otherwise it’s a weighted minkowski norm. Also, the periodicity is 360 in some cases and 180 in others.

I have data in CSV files that I want to load for various classes, and I do so like this:

> data = Dict(splitext(basename(f))[1] => CSV.File(f) for f = Glob.glob("*.csv", "./sample-data/"))

Dict{String,CSV.File{false}} with 12 entries:
  "class1" => CSV.File("./sample-data/class1.csv"):…
  "class2"   => CSV.File("./sample-data/class2.csv"):…
  "class3" => CSV.File("./sample-data/class3.csv"):…
 ... etc

So far so good…

now I want to do things like:

  • compute a distance matrix between two of the classes. I tried Distances.jl; didn’t see a way to run a custom distance function in the documentation.
  • compute a distance matrix for all the data (~2000 rows, 48 features; would be a 2000x2000 distance matrix) so I can see whether my distance function accurately recovers the classes (elements of one class are “closer” than elements of other classes)
  • try clustering (agglomerative or density based) with this distance function and validate that the clusters I find are sensible and do not contain mixed data. meaning that I have a strong idea of what “distance” means in this data set
  • some kind of “search” to find good weights for the distance function

I will be iterating on the distance function so speed for the distance matrix would be helpful! (I get that nonstandard distance = have to figure this out myself, not clear how though).

I’m getting hung up on simple things like:

julia> dist(ones(48))(data["class1"][2], data["class2"][1])
ERROR: MethodError: no method matching -(::CSV.Row, ::CSV.Row)
Closest candidates are:
  -(::ChainRulesCore.DoesNotExist, ::Any) at /Users/Vishesh/.julia/packages/ChainRulesCore/PUnER/src/differential_arithmetic.jl:25
  -(::ChainRulesCore.Zero, ::Any) at /Users/Vishesh/.julia/packages/ChainRulesCore/PUnER/src/differential_arithmetic.jl:65
  -(::DataValues.DataValue{T1}, ::T2) where {T1, T2} at /Users/Vishesh/.julia/packages/DataValues/N7oeL/src/scalar/operations.jl:65
  ...
Stacktrace:
 [1] (::var"#56#57"{Array{Float64,1},Int64})(::CSV.Row, ::CSV.Row) at ./REPL[291]:1
 [2] top-level scope at REPL[311]:1

what? Also, another test like:

julia> Distances.pairwise(Distances.Euclidean(), data["class1"])
ERROR: MethodError: no method matching pairwise(::Type{Distances.Euclidean}, ::CSV.File{false})
Stacktrace:
 [1] top-level scope at REPL[313]:1
julia> Distances.pairwise(Distances.Euclidean(), data["class1"] |> Tables.matrix) # works
49×49 Array{Float64,2}:
... numbers

When I try my distance function it doesn’t work:

julia> Distances.pairwise(dist(ones(48)), data["class1"] |> Tables.matrix)
ERROR: MethodError: no method matching pairwise(::var"#56#57"{Array{Float64,1},Int64}, Array{Float64, 2}; dims)
Stacktrace:
 [1] top-level scope at REPL[320]:1
julia> Distances.pairwise((x,y) -> 1, data["class1"] |> Tables.matrix)
ERROR: MethodError: no method matching pairwise(::var"#66#67", ::Array{Float64,2})
Closest candidates are:
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T; dims) at ...
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T, ::AbstractArray{T,2} where T; dims) at ...
 [1] top-level scope at REPL[325]:1

I figured I’m travelling the wrong path here and it would be smarter to ask:

  • do I need to write my own distance matrix function or is there something fast out there I can use
  • is my distance function optimized? is there some more optimal way to write it?

For distance matrix using Distances.jl, you just want R = pairwise(dist, X, Y, dims=2) where dist is your custom function.

julia> Distances.pairwise(dist(ones(48)), data["class1"], data["class2"], dims=2)
ERROR: MethodError: no method matching pairwise(::var"#56#57"{Array{Float64,1},Int64}, ::CSV.File{false}, ::CSV.File{false}; dims=2)
Stacktrace:
 [1] top-level scope at REPL[343]:1

julia> Distances.pairwise(dist(ones(48)), data["class1"] |> Tables.matrix, data["class2"] |> Tables.matrix, dims=2)
ERROR: MethodError: no method matching pairwise(::var"#56#57"{Array{Float64,1},Int64}, ::Array{Float64,2}, ::Array{Float64,2}; dims=2)
Closest candidates are:
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T, ::AbstractArray{T,2} where T; dims) 
  pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T; dims) 
Stacktrace:
 [1] top-level scope at REPL[344]:1

is what I got when I tried that.

you want your data as a Vector of Vectors rather than as a Matrix.

doesn’t seem to be it?

julia> Distances.pairwise((x,y) -> 1, map(collect, data["class1"]))
ERROR: MethodError: no method matching pairwise(::var"#76#77", ::Array{Array{Any,1},1})
Stacktrace:
 [1] top-level scope at REPL[383]:1

julia> Distances.pairwise((x,y) -> 1, map(collect,data["class1"]), map(collect, data["class1"]))
ERROR: MethodError: no method matching pairwise(::var"#78#79", ::Array{Array{Any,1},1}, ::Array{Array{Any,1},1})
Stacktrace:
 [1] top-level scope at REPL[384]:1

julia> Distances.pairwise((x,y) -> 1, map(collect, data["class1"]), map(collect, data["class1"]), dims=2)
ERROR: MethodError: no method matching pairwise(::var"#80#81", ::Array{Array{Any,1},1}, ::Array{Array{Any,1},1}; dims=2)
Stacktrace:
 [1] top-level scope at REPL[386]:100

pairwise takes (::Distances.PreMetric, ::AbstractArray{T,2})
I think the question is how to register a custom metric as a PreMetric so the function compiles, or alternatively how to write a version of pairiwise that dispatches on any function that takes two vectors

Here’s as far as I got, based on what I see in Distances.jl:

struct WeightedPeriodicMinkowski{T, W, P <: Real} <: Distances.UnionMetric 
    periods: T
    weights: W
    p: P
end

# no idea how to do this part - is this right?
Distances.parameters(wpm::WeightedPeriodicMinkowski) = zip(wpm.periods, wpm.weights)

@inline function Distances.eval_op(d::WeightedPeriodicMinkowski, ai, bi, (Ti, wi))
    # Taken from the PeriodicEuclidean function 
    s1 = abs(ai - bi)
    s2 = mod(s1, Ti)
    # taken from the WeightedMinkowski function
    abs(min(s2, Ti - s2))^d.p * wi
end
@inline Distances.eval_end(d::WeightedPeriodicMinkowski, s) = s^(1/d.p)
wpminkowski(a, b, w, T, p) = WeightedPeriodicMinkowski(T, w, p)(a, b)

When I try:

julia> wpminkowski([1,2], [3,4], [1,1], [2,2], 2)
ERROR: MethodError: objects of type WeightedPeriodicMinkowski{Array{Int64,1},Array{Int64,1},Int64} are not callable

So I’m very lost.

Here’s what I had to do to get this working (basically copy Distances.jl’s internals and make changes):
Any ideas on how this could be accomplished faster/easier, please tell me! I’m really surprised that there is no simpler way to do this… I got lucky that the metric was a type that I could cobble together from the ones presented in the Distances.jl documentation, but I’m still not sure what I will do if the Distance function I try later is not of this type - write this whole thing again?

struct WeightedPeriodicMinkowski{T, W, P <: Real} <: Distances.UnionMetric 
    weights:: W
    periods:: T
    p:: P
end

Distances.parameters(wpm::WeightedPeriodicMinkowski) = (wpm.periods, wpm.weights)

@inline function Distances.eval_op(d::WeightedPeriodicMinkowski, ai, bi, Ti, wi)
    # Taken from the PeriodicEuclidean function 
    s1 = abs(ai - bi)
    s2 = mod(s1, Ti)
    # taken from the WeightedMinkowski function
    abs(min(s2, Ti - s2))^d.p * wi
end
@inline Distances.eval_end(d::WeightedPeriodicMinkowski, s) = s^(1/d.p)
wpminkowski(a, b, w, T, p) = WeightedPeriodicMinkowski(T, w, p)(a, b)
(w::WeightedPeriodicMinkowski)(a,b) = Distances._evaluate(w, a, b)

Distances.result_type(dist::Distances.UnionMetrics, ::Type{Ta}, ::Type{Tb}, (p1, p2)) where {Ta,Tb} =
    typeof(Distances._evaluate(dist, oneunit(Ta), oneunit(Tb), oneunit(eltype(p1)), oneunit(eltype(p2))))

function Distances._evaluate(dist::Distances.UnionMetrics, a::Number, b::Number, p1::Number, p2::Number)
    Distances.eval_end(dist, Distances.eval_op(dist, a, b, p1, p2))
end 
    
Base.@propagate_inbounds function Distances._evaluate(d::Distances.UnionMetrics, a::AbstractArray, b::AbstractArray, (p1, p2)::Tuple{AbstractArray, AbstractArray})
    @boundscheck if length(a) != length(b)
        throw(DimensionMismatch("first array has length $(length(a)) which does not match the length of the second, $(length(b))."))
    end
    @boundscheck if length(a) != length(p1)
        throw(DimensionMismatch("arrays have length $(length(a)) but parameter 1 has length $(length(p1))."))
    end
    @boundscheck if length(a) != length(p2)
        throw(DimensionMismatch("arrays have length $(length(a)) but parameter 2 has length $(length(p2))."))
    end
    if length(a) == 0
        return zero(result_type(d, a, b))
    end
    @inbounds begin
        s = eval_start(d, a, b)
        if (IndexStyle(a, b, p1, p2) === IndexLinear() && eachindex(a) == eachindex(b) == eachindex(p1)) == eachindex(p2)||
                axes(a) == axes(b) == axes(p) == axes(p2)
            @simd for I in eachindex(a, b, p1, p2)
                ai = a[I]
                bi = b[I]
                p1i = p1[I]
                p2i = p2[I]
                s = eval_reduce(d, s, eval_op(d, ai, bi, p1i, p2i))
            end
        else
            for (ai, bi, p1i, p2i) in zip(a, b, p1, p2)
                s = eval_reduce(d, s, eval_op(d, ai, bi, p1i, p2i))
            end
        end
        return eval_end(d, s)
    end
end

Then I am finally able to write:
Distances.pairwise(WeightedPeriodicMinkowski(repeat([Inf, 360, 180, Inf, Inf, Inf, Inf, 360],6), ones(48), 2), data["class1"] |> Tables.matrix |> transpose) and get results.

1 Like