Does anyone see any way to speed things up apart from the obvious things to use Float32 or LoopVectorisation? I feel as though I’m not quite seeing an opportunity to speed things up somewhere.
I think the tullio version should become much faster if you also import LoopVectorization.
Also, := allocates a new array in tullio, so the second line of your calc2 function is redundant, I think.
I think that pairwise from Distances is internally a call to a very efficient BLAS library, which by default runs in parallel already. Thus, for that specific type of computation I think it is hard to beat it.
Unless you explore a little bit further the structure of your problem and try to avoid constructing Z1r at all, for example (not sure if its doable, just guessing here).
As @IlianPihlajamaa correctly points out, there’s a redundant allocation in calc2, and LoopVectorization is needed for best results. My times:
julia> function calc1(r::Array{Float64, 1}, Z1::Array{Float64, 2})
Z1r = Z1' .* sqrt.(r)
result = pairwise(SqEuclidean(), Z1r, dims=2) # must be SqEuclidean()
return result'
end;
julia> function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
@tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
@tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
end;
julia> @btime calc1($r, $Z1);
22.542 μs (4 allocations: 94.80 KiB)
julia> @btime calc2($r, $Z1);
21.333 μs (28 allocations: 95.23 KiB)
julia> using LoopVectorization
julia> function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
# the macro needs to run after LoopVectorization is loaded
@tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
@tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
end;
julia> @btime calc2($r, $Z1);
13.167 μs (27 allocations: 95.20 KiB)
julia> calc1(r, Z1) ≈ calc2(r, Z1)
true
julia> norm(calc1(r, Z1) .- calc2(r, Z1))
1.1767515652473147e-13
julia> ncols = 5_000; # try a larger size too
julia> r = rand(ncols);
julia> Z1 = rand(100, ncols);
julia> @btime calc1($r, $Z1);
2.009 ms (5 allocations: 3.89 MiB)
julia> @btime calc2($r, $Z1);
2.230 ms (98 allocations: 3.90 MiB)
Small numerical errors are pretty much unavoidable. I see Distances.jl now has a parameter SqEuclidean(1e-12) which controls when it should re-calculate very small distances by a more accurate method.
Great answers all! I had not noticed the redundant calculation, thanks @IlianPihlajamaa and @mcabbott
Alas, this is trying to match an R function so I’m stuck with it. However, I’ve been exploring a one line answer by moving the r multiplication - the reason I’m exploring this is because there will be another version with a Z2 and corresponding Z2r, and so if I could only multiplying by r once that would save time. So I was playing with this:
function calc3(r::Array{Float64, 1}, Z1::Array{Float64, 2})
@tullio result[i,j] := (sqrt(r[μ]) * (Z1[i,μ] - Z1[j,u]))^2
end
But the answers are coming out very wrong and this is quite slow. I think I’m making an algebra mistake but I can’t see it. But otherwise wondering why this is slower?
No its the squared euclidean distance I want - I updated the title to reflect that just now. The solutions above all work except calc3 which is wrong somewhere but I can’t see where.
I think this is the problem with calc3 – it is independently summing over u, which is wrong, and another nested loop, thus slow. You should be able to skip the sqrt too:
The difference is greek mu μ vs latin u. Hard to spot in some fonts!
It’s a bit of a footgun with the “automatically sum over free indices” rule. It’s possible that requiring you to specify what’s reduced over would be safer, if less concise:
julia> using TensorCast
julia> @reduce result[i,j] := sum(μ) (sqrt(r[μ]) * (Z1[i,μ] - Z1[j,u]))^2
ERROR: LoadError: can't find index u on the left