Status of Distances.jl


#1

Is the package supposed to work on Julia v1? I see a commit about support for v0.7 but I get an error:

julia> movies[1,:]
9-element Array{Any,1}:
  "Moonlight (2016)"
 0
 0
 0
 1
 0
 0
 0
 0

julia> nemo
9-element Array{Any,1}:
  "Finding Nemo (2003)"
 0
 1
 1
 0
 1
 0
 0
 0

julia> hamming(movies[1,:], nemo)
ERROR: MethodError: no method matching one(::Type{Any})
Closest candidates are:
  one(::Type{Union{Missing, T}}) where T at missing.jl:83
  one(::Missing) at missing.jl:79
  one(::BitArray{2}) at bitarray.jl:392
  ...
Stacktrace:
 [1] one(::Type{Any}) at ./missing.jl:83
 [2] result_type(::Hamming, ::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/packages/Distances/nLAdT/src/metrics.jl:194
 [3] eval_start(::Hamming, ::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/packages/Distances/nLAdT/src/metrics.jl:196
 [4] evaluate at /Users/adrian/.julia/packages/Distances/nLAdT/src/metrics.jl:159 [inlined]
 [5] hamming(::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/packages/Distances/nLAdT/src/metrics.jl:240
 [6] top-level scope at none:0

#2

Distances.jl is working fine on 1.0. I think this is simply a bug; you can work around it by

julia> Distances.result_type(::Hamming, ::AbstractArray{T1,N} where {T1,N}, ::AbstractArray{T2,N} where {T2,N})=Int

In reality, Hamming probably wants a separate implementation anyway. Desired handling of NaN and missing is not entirely obvious though.

julia> using Random, BenchmarkTools, Distances
julia> a=bitrand(10^5); b = bitrand(10^5);
julia> @btime evaluate($Hamming(), $a, $b);
  154.316 μs (1 allocation: 16 bytes)
julia> _fhamming(a,b) = count(a.==b);
julia> @btime _fhamming($a, $b);
  2.351 μs (2 allocations: 12.41 KiB)
julia> _fhamming(a,b) = count(isequal.(a,b));
julia> @btime _fhamming($a, $b);
  328.005 μs (3 allocations: 16.59 KiB)

#3

Thank you, this pointed me in the right direction. It doesn’t like the types of the arrays (it used to work fine in 0.6).

Ex:

julia> hamming(x,y)
ERROR: MethodError: no method matching one(::Type{Any})
Closest candidates are:
  one(::Type{Union{Missing, T}}) where T at missing.jl:83
  one(::Missing) at missing.jl:79
  one(::BitArray{2}) at bitarray.jl:392
  ...
Stacktrace:
 [1] one(::Type{Any}) at ./missing.jl:83
 [2] result_type(::Hamming, ::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/dev/Distances/src/metrics.jl:194
 [3] eval_start(::Hamming, ::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/dev/Distances/src/metrics.jl:196
 [4] evaluate at /Users/adrian/.julia/dev/Distances/src/metrics.jl:159 [inlined]
 [5] hamming(::Array{Any,1}, ::Array{Any,1}) at /Users/adrian/.julia/dev/Distances/src/metrics.jl:240
 [6] top-level scope at none:0

julia> hamming(Int[x...],Int[y...])
4

#4

But really, the above is not a reliable fix.

I’m not sure about the intended semantics for missing and NaN, but you the above fix will run into trouble with missing values. And the bitarray variant (count broadcasted ==) is also bad; the real solution is to implement a specialization for bitarray (because that’s the main use for hamming distance).

Can you open an issue for this?


#5

Sure!


#6

I’m afraid I don’t understand the internals so feel free to expand on this if necessary:


#7

https://github.com/JuliaStats/Distances.jl/issues/114#issuecomment-433129061 for why this worked on 0.6 and not 0.7 (spoiler: 0.6 optimization bug).