# Optimize dot(::Vector{Float64}, ::Vector{ForwardDiff.Dual})

I identified a bottleneck in my code — calculating dot products of a vector of floats and some `ForwardDiff.Dual`s. I have not been able to improve on the generic `dot` (which, of course, is not surprising in Julia, as the built-ins are usually efficient).

Here is an MWE in case someone has a suggestion on how to make it faster:

``````using ForwardDiff, BenchmarkTools

_dual(x) = ForwardDiff.Dual(x, x + 1, x + 2, x + 3, x + 4)
o = ones(69)
Do = _dual.(o)

function mydot(a, b)
@assert !Base.has_offset_axes(a, b)
len = length(a)
@assert len == length(b)
z = a[1] * b[1]
for i in 2:len
z += a[i] * b[i]
end
z
end

@btime dot(o, Do)
@btime mydot(o, Do) # doh
``````

Vector lengths reflect typical sizes of my problem. I am using 1.4-rc1, but can use `master` or anything else if it helps.

Just trying something… The code below assumes that the vector length is a multiple of 4 but it is easy to add some extra code to the end if that is not the case.

``````function mydot(a, b)
@assert !Base.has_offset_axes(a, b)
len = length(a)
@assert len == length(b)
z1 = zero(a[1] * b[1])
z2 = z1
z3 = z1
z4 = z1
i = 1
@inbounds while i < len
z1 += a[i] * b[i]
z2 += a[i+1] * b[i+1]
z3 += a[i+2] * b[i+2]
z4 += a[i+3] * b[i+3]
i += 4
end
return z1 + z2 + z3 + z4
end
``````
``````julia> o = ones(4*16);

julia> Do = _dual.(o);

julia> @btime mydot(o, Do) # doh
112.132 ns (1 allocation: 48 bytes)
Dual{Nothing}(64.0,128.0,192.0,256.0,320.0)

julia> @btime dot(o, Do) # doh
144.006 ns (1 allocation: 48 bytes)
Dual{Nothing}(64.0,128.0,192.0,256.0,320.0)
``````
1 Like

The 4 is for SIMD?

This is very neat:

``````using ForwardDiff, BenchmarkTools, LinearAlgebra

_dual(x) = ForwardDiff.Dual(x, x + 1, x + 2, x + 3, x + 4)

function dot4(a, b, len_trunc)
z1 = z2 = z3 = z4 = zero(a[1] * b[1])
i = 1
@inbounds while i < len_trunc
z1 += a[i] * b[i]
z2 += a[i+1] * b[i+1]
z3 += a[i+2] * b[i+2]
z4 += a[i+3] * b[i+3]
i += 4
end
z1 + z2 + z3 + z4
end

function dot_tail(z, a, b, i, len)
@inbounds while i < len
i += 1
z += a[i] * b[i]
end
z
end

function mydot(a, b)
@assert !Base.has_offset_axes(a, b)
len = length(a)
@assert len == length(b)
len_trunc = len - len % 4
z = dot4(a, b, len_trunc)
dot_tail(z, a, b, len_trunc, len)
end

o = randn(69)
Do = _dual.(o)

@btime dot(o, Do)               # 97ns
@btime mydot(o, Do)             # 69ns
``````

It is basically to allow 4 accumulators to run “in parallel” since they have no dependencies on each other. Similar to what the `@simd` macro allows

1 Like