I have a very simple code I’d like to optimize, and I’m not sure I am getting the expected results. For the sake of simplicity I am optimizing a loop which computes the dot product. I have implemented two functions, a serial dot product, sdot
, and a parallel one, pdot
, which attempts to use idiomatic reduction:
function sdot(n, x, y)
a = 0 # This should be a=0. for type stability
@inbounds @fastmath @simd for i=1:n
a += x[i]*y[i]
end
a
end
function pdot(n, x, y)
@fastmath @parallel (+) for i=1:n
x[i]*y[i]
end
end
I have timed running these functions 3 times for x
and y
vectors of size 10000000
(full code here). I also compare them to the built-in dot product. These are my results:
Function | @time |
---|---|
sdot |
0.878811 seconds (90.00 M allocations: 1.341 GiB, 4.44% gc time) |
sdot (type stable) |
0.034074 seconds (3 allocations: 48 bytes) |
pdot |
1.068616 seconds (1.63 k allocations: 139.344 KiB) |
dot |
0.036030 seconds (3 allocations: 48 bytes) |
dot
is clearly the fastest, which is alright given that it relies on ddot
from BLAS. What puzzles me is that my parallel implementation is always slower that my serial implementation. I understand that there is some overhead in @parallel
, but I wouldn’t imagine it would be so high. I have changed the values of n
through several different orders of magnitude, but pdot
always loses. Interestingly, I have coded similar versions in Fortran, and the OMP REDUCE
version is always fastest. Those codes can be found here.
A gist to reproduce my results can be found here. Any help is appreciated!
EDIT: I have added a type-stable sdot
as per comments