Dot(x, inv.(y)) vs non-allocating one, the former is faster?

In looking at some recent updates in Distributions I saw code like this dot(w, inv.(x)) and thought that better could be done to avoid allocating inv.(x) however what I came up with was always slower and I’m not sure why.

julia> x = randn(100_000) .+ 100
julia> y = randn(100_000) .+ 100
julia> f(x, y) = dot(x, inv.(y))
julia> g(x, y) = sum(@inbounds x[i]/y[i] for i in eachindex(x))
julia> @btime f($x, $y)
  74.495 μs (2 allocations: 781.33 KiB)
100011.86751975697
julia> @btime g($x, $y)
  111.424 μs (2 allocations: 64 bytes)
100011.86751975746

Any idea why the second one is slower? purely in terms of flops it should be about the same shouldn’t it?

dot is BLAS call so it is threaded(I see 4 core usage) but I don’t see that dramatic difference in my laptop.

Locally I see

> @btime f($x, $y)
  78.900 μs (2 allocations: 781.33 KiB)
100006.09856166482

> @btime g($x, $y)
  81.599 μs (2 allocations: 64 bytes)
100006.09856166548

but just writing out the loop can get some speedup:

 function h(x, y)
    @assert length(x) == length(y) 
    s = zero( one(eltype(x)) / one(eltype(y)) )
    @inbounds @simd for i in 1:length(x)
        s += x[i] / y[i]
    end
    return s
end

> @btime h($x, $y)
  40.799 μs (0 allocations: 0 bytes)
100006.09856166487
1 Like

thanks yes I had thought of using @simd as well but it was to replace a one-liner and seemed a bit overkill. (the factor 2 speedup is quite impressive though)

It does seem there’s something to do with the laptop and not the algorithm though as I can definitely reproduce a ~50-60% time difference but others (as per your results) don’t seem to get the same difference. Versioninfo below fwiw.

Julia Version 1.2.0-DEV.376
Commit de48421024 (2019-02-25 22:13 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

If compute speed is relatively slower compared to memory then perhaps multithreading gives you a speedup.

2 Likes

Indeed, this problem is very amenable to parallelization:

function dot_threaded(x, y; N = Threads.nthreads())
    M = length(x)
    @assert M == length(y)
    z = one(eltype(x)) / one(eltype(y))
    acc = zeros(typeof(z), N)
    Threads.@threads for t = 1:N
        s = zero(z)
        @inbounds @simd for i = M*(t-1)÷N+1 : M*t÷N
            s += x[i] / y[i]
        end
        acc[t] = s
    end
    return sum(acc)
end

Performance:

julia> x, y = 1:2 .|>_-> randn(100_000) .+ 100;

julia> @btime dot_inv($x, $y) # "f" from above: dot(x, inv.(y))
  74.980 μs (2 allocations: 781.33 KiB)
100007.82225468596

julia> @btime dot_simd($x, $y) # kristoffer's "h" from above
  57.328 μs (0 allocations: 0 bytes)
100007.822254686

julia> @btime dot_threaded($x, $y)
  15.452 μs (5 allocations: 240 bytes)
100007.82225468596

julia> 57.328 / 15.452
3.710069893864872

julia> Threads.nthreads()
4

With that said, I generally don’t think it’s a good idea to parallelize your inner-most loops this way. It tends to lead to complex bug-fraught code, and poor overall parallelization. I think parallelization is better applied on as coarse of a level as possible.

3 Likes