I have a very simple code I’d like to optimize, and I’m not sure I am getting the expected results. For the sake of simplicity I am optimizing a loop which computes the dot product. I have implemented two functions, a serial dot product, `sdot`

, and a parallel one, `pdot`

, which attempts to use idiomatic reduction:

```
function sdot(n, x, y)
a = 0 # This should be a=0. for type stability
@inbounds @fastmath @simd for i=1:n
a += x[i]*y[i]
end
a
end
function pdot(n, x, y)
@fastmath @parallel (+) for i=1:n
x[i]*y[i]
end
end
```

I have timed running these functions 3 times for `x`

and `y`

vectors of size `10000000`

(full code here). I also compare them to the built-in dot product. These are my results:

Function | `@time` |
---|---|

`sdot` |
0.878811 seconds (90.00 M allocations: 1.341 GiB, 4.44% gc time) |

`sdot` (type stable) |
0.034074 seconds (3 allocations: 48 bytes) |

`pdot` |
1.068616 seconds (1.63 k allocations: 139.344 KiB) |

`dot` |
0.036030 seconds (3 allocations: 48 bytes) |

`dot`

is clearly the fastest, which is alright given that it relies on `ddot`

from BLAS. What puzzles me is that my parallel implementation is always slower that my serial implementation. I understand that there is some overhead in `@parallel`

, but I wouldn’t imagine it would be so high. I have changed the values of `n`

through several different orders of magnitude, but `pdot`

always loses. Interestingly, I have coded similar versions in Fortran, and the `OMP REDUCE`

version is always fastest. Those codes can be found here.

A gist to reproduce my results can be found here. Any help is appreciated!

EDIT: I have added a type-stable `sdot`

as per comments