Dot(x, inv.(y)) vs non-allocating one, the former is faster?


In looking at some recent updates in Distributions I saw code like this dot(w, inv.(x)) and thought that better could be done to avoid allocating inv.(x) however what I came up with was always slower and I’m not sure why.

julia> x = randn(100_000) .+ 100
julia> y = randn(100_000) .+ 100
julia> f(x, y) = dot(x, inv.(y))
julia> g(x, y) = sum(@inbounds x[i]/y[i] for i in eachindex(x))
julia> @btime f($x, $y)
  74.495 μs (2 allocations: 781.33 KiB)
julia> @btime g($x, $y)
  111.424 μs (2 allocations: 64 bytes)

Any idea why the second one is slower? purely in terms of flops it should be about the same shouldn’t it?


dot is BLAS call so it is threaded(I see 4 core usage) but I don’t see that dramatic difference in my laptop.


Locally I see

> @btime f($x, $y)
  78.900 μs (2 allocations: 781.33 KiB)

> @btime g($x, $y)
  81.599 μs (2 allocations: 64 bytes)

but just writing out the loop can get some speedup:

 function h(x, y)
    @assert length(x) == length(y) 
    s = zero( one(eltype(x)) / one(eltype(y)) )
    @inbounds @simd for i in 1:length(x)
        s += x[i] / y[i]
    return s

> @btime h($x, $y)
  40.799 μs (0 allocations: 0 bytes)


thanks yes I had thought of using @simd as well but it was to replace a one-liner and seemed a bit overkill. (the factor 2 speedup is quite impressive though)

It does seem there’s something to do with the laptop and not the algorithm though as I can definitely reproduce a ~50-60% time difference but others (as per your results) don’t seem to get the same difference. Versioninfo below fwiw.

Julia Version 1.2.0-DEV.376
Commit de48421024 (2019-02-25 22:13 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


If compute speed is relatively slower compared to memory then perhaps multithreading gives you a speedup.


Indeed, this problem is very amenable to parallelization:

function dot_threaded(x, y; N = Threads.nthreads())
    M = length(x)
    @assert M == length(y)
    z = one(eltype(x)) / one(eltype(y))
    acc = zeros(typeof(z), N)
    Threads.@threads for t = 1:N
        s = zero(z)
        @inbounds @simd for i = M*(t-1)÷N+1 : M*t÷N
            s += x[i] / y[i]
        acc[t] = s
    return sum(acc)


julia> x, y = 1:2 .|>_-> randn(100_000) .+ 100;

julia> @btime dot_inv($x, $y) # "f" from above: dot(x, inv.(y))
  74.980 μs (2 allocations: 781.33 KiB)

julia> @btime dot_simd($x, $y) # kristoffer's "h" from above
  57.328 μs (0 allocations: 0 bytes)

julia> @btime dot_threaded($x, $y)
  15.452 μs (5 allocations: 240 bytes)

julia> 57.328 / 15.452

julia> Threads.nthreads()

With that said, I generally don’t think it’s a good idea to parallelize your inner-most loops this way. It tends to lead to complex bug-fraught code, and poor overall parallelization. I think parallelization is better applied on as coarse of a level as possible.