Hi I’m new to julia and trying to figure out the best way to compute some distance matrices and evaluate how good my clustering is…
I have a custom distance function like this:
dist(w::Array{<:Real,1}, p::Int64 = 2) = (x, y) -> sum( w'.(min.(abs.(x-y), abs.((x-y) + repeat([0, 360, 180, 0, 0, 0, 0, 360],6)))).^p)^(1/p)
The reason its complicated is that it’s a distance function where some of the variables are periodic and some of the variables are on the real number line. Otherwise it’s a weighted minkowski norm. Also, the periodicity is 360 in some cases and 180 in others.
I have data in CSV files that I want to load for various classes, and I do so like this:
> data = Dict(splitext(basename(f))[1] => CSV.File(f) for f = Glob.glob("*.csv", "./sample-data/"))
Dict{String,CSV.File{false}} with 12 entries:
"class1" => CSV.File("./sample-data/class1.csv"):…
"class2" => CSV.File("./sample-data/class2.csv"):…
"class3" => CSV.File("./sample-data/class3.csv"):…
... etc
So far so good…
now I want to do things like:
- compute a distance matrix between two of the classes. I tried Distances.jl; didn’t see a way to run a custom distance function in the documentation.
- compute a distance matrix for all the data (~2000 rows, 48 features; would be a 2000x2000 distance matrix) so I can see whether my distance function accurately recovers the classes (elements of one class are “closer” than elements of other classes)
- try clustering (agglomerative or density based) with this distance function and validate that the clusters I find are sensible and do not contain mixed data. meaning that I have a strong idea of what “distance” means in this data set
- some kind of “search” to find good weights for the distance function
I will be iterating on the distance function so speed for the distance matrix would be helpful! (I get that nonstandard distance = have to figure this out myself, not clear how though).
I’m getting hung up on simple things like:
julia> dist(ones(48))(data["class1"][2], data["class2"][1])
ERROR: MethodError: no method matching -(::CSV.Row, ::CSV.Row)
Closest candidates are:
-(::ChainRulesCore.DoesNotExist, ::Any) at /Users/Vishesh/.julia/packages/ChainRulesCore/PUnER/src/differential_arithmetic.jl:25
-(::ChainRulesCore.Zero, ::Any) at /Users/Vishesh/.julia/packages/ChainRulesCore/PUnER/src/differential_arithmetic.jl:65
-(::DataValues.DataValue{T1}, ::T2) where {T1, T2} at /Users/Vishesh/.julia/packages/DataValues/N7oeL/src/scalar/operations.jl:65
...
Stacktrace:
[1] (::var"#56#57"{Array{Float64,1},Int64})(::CSV.Row, ::CSV.Row) at ./REPL[291]:1
[2] top-level scope at REPL[311]:1
what? Also, another test like:
julia> Distances.pairwise(Distances.Euclidean(), data["class1"])
ERROR: MethodError: no method matching pairwise(::Type{Distances.Euclidean}, ::CSV.File{false})
Stacktrace:
[1] top-level scope at REPL[313]:1
julia> Distances.pairwise(Distances.Euclidean(), data["class1"] |> Tables.matrix) # works
49×49 Array{Float64,2}:
... numbers
When I try my distance function it doesn’t work:
julia> Distances.pairwise(dist(ones(48)), data["class1"] |> Tables.matrix)
ERROR: MethodError: no method matching pairwise(::var"#56#57"{Array{Float64,1},Int64}, Array{Float64, 2}; dims)
Stacktrace:
[1] top-level scope at REPL[320]:1
julia> Distances.pairwise((x,y) -> 1, data["class1"] |> Tables.matrix)
ERROR: MethodError: no method matching pairwise(::var"#66#67", ::Array{Float64,2})
Closest candidates are:
pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T; dims) at ...
pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T, ::AbstractArray{T,2} where T; dims) at ...
[1] top-level scope at REPL[325]:1
I figured I’m travelling the wrong path here and it would be smarter to ask:
- do I need to write my own distance matrix function or is there something fast out there I can use
- is my distance function optimized? is there some more optimal way to write it?