In looking at some recent updates in Distributions I saw code like this `dot(w, inv.(x))`

and thought that better could be done to avoid allocating `inv.(x)`

however what I came up with was always slower and I’m not sure why.

```
julia> x = randn(100_000) .+ 100
julia> y = randn(100_000) .+ 100
julia> f(x, y) = dot(x, inv.(y))
julia> g(x, y) = sum(@inbounds x[i]/y[i] for i in eachindex(x))
julia> @btime f($x, $y)
74.495 μs (2 allocations: 781.33 KiB)
100011.86751975697
julia> @btime g($x, $y)
111.424 μs (2 allocations: 64 bytes)
100011.86751975746
```

Any idea why the second one is slower? purely in terms of flops it should be about the same shouldn’t it?

`dot`

is BLAS call so it is threaded(I see 4 core usage) but I don’t see that dramatic difference in my laptop.

Locally I see

```
> @btime f($x, $y)
78.900 μs (2 allocations: 781.33 KiB)
100006.09856166482
> @btime g($x, $y)
81.599 μs (2 allocations: 64 bytes)
100006.09856166548
```

but just writing out the loop can get some speedup:

```
function h(x, y)
@assert length(x) == length(y)
s = zero( one(eltype(x)) / one(eltype(y)) )
@inbounds @simd for i in 1:length(x)
s += x[i] / y[i]
end
return s
end
> @btime h($x, $y)
40.799 μs (0 allocations: 0 bytes)
100006.09856166487
```

1 Like

thanks yes I had thought of using @simd as well but it was to replace a one-liner and seemed a bit overkill. (the factor 2 speedup is quite impressive though)

It does seem there’s something to do with the laptop and not the algorithm though as I can definitely reproduce a ~50-60% time difference but others (as per your results) don’t seem to get the same difference. Versioninfo below fwiw.

```
Julia Version 1.2.0-DEV.376
Commit de48421024 (2019-02-25 22:13 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
```

If compute speed is relatively slower compared to memory then perhaps multithreading gives you a speedup.

2 Likes

Indeed, this problem is very amenable to parallelization:

```
function dot_threaded(x, y; N = Threads.nthreads())
M = length(x)
@assert M == length(y)
z = one(eltype(x)) / one(eltype(y))
acc = zeros(typeof(z), N)
Threads.@threads for t = 1:N
s = zero(z)
@inbounds @simd for i = M*(t-1)÷N+1 : M*t÷N
s += x[i] / y[i]
end
acc[t] = s
end
return sum(acc)
end
```

Performance:

```
julia> x, y = 1:2 .|>_-> randn(100_000) .+ 100;
julia> @btime dot_inv($x, $y) # "f" from above: dot(x, inv.(y))
74.980 μs (2 allocations: 781.33 KiB)
100007.82225468596
julia> @btime dot_simd($x, $y) # kristoffer's "h" from above
57.328 μs (0 allocations: 0 bytes)
100007.822254686
julia> @btime dot_threaded($x, $y)
15.452 μs (5 allocations: 240 bytes)
100007.82225468596
julia> 57.328 / 15.452
3.710069893864872
julia> Threads.nthreads()
4
```

With that said, I generally don’t think it’s a good idea to parallelize your inner-most loops this way. It tends to lead to complex bug-fraught code, and poor overall parallelization. I think parallelization is better applied on as coarse of a level as possible.

3 Likes