Optimize dot(::Vector{Float64}, ::Vector{ForwardDiff.Dual})

I identified a bottleneck in my code — calculating dot products of a vector of floats and some ForwardDiff.Duals. I have not been able to improve on the generic dot (which, of course, is not surprising in Julia, as the built-ins are usually efficient).

Here is an MWE in case someone has a suggestion on how to make it faster:

using ForwardDiff, BenchmarkTools

_dual(x) = ForwardDiff.Dual(x, x + 1, x + 2, x + 3, x + 4)
o = ones(69)
Do = _dual.(o)

function mydot(a, b)
    @assert !Base.has_offset_axes(a, b)
    len = length(a)
    @assert len == length(b)
    z = a[1] * b[1]
    for i in 2:len
        z += a[i] * b[i]
    end
    z
end

@btime dot(o, Do)
@btime mydot(o, Do) # doh

Vector lengths reflect typical sizes of my problem. I am using 1.4-rc1, but can use master or anything else if it helps.

Just trying something… The code below assumes that the vector length is a multiple of 4 but it is easy to add some extra code to the end if that is not the case.

function mydot(a, b)
    @assert !Base.has_offset_axes(a, b)
    len = length(a)
    @assert len == length(b)
    z1 = zero(a[1] * b[1])
    z2 = z1
    z3 = z1
    z4 = z1
    i = 1
    @inbounds while i < len 
        z1 += a[i] * b[i]
        z2 += a[i+1] * b[i+1]
        z3 += a[i+2] * b[i+2]
        z4 += a[i+3] * b[i+3]
        i += 4
    end
    return z1 + z2 + z3 + z4
end
julia> o = ones(4*16);

julia> Do = _dual.(o);

julia> @btime mydot(o, Do) # doh
  112.132 ns (1 allocation: 48 bytes)
Dual{Nothing}(64.0,128.0,192.0,256.0,320.0)

julia> @btime dot(o, Do) # doh
  144.006 ns (1 allocation: 48 bytes)
Dual{Nothing}(64.0,128.0,192.0,256.0,320.0)
1 Like

The 4 is for SIMD?

This is very neat:

using ForwardDiff, BenchmarkTools, LinearAlgebra

_dual(x) = ForwardDiff.Dual(x, x + 1, x + 2, x + 3, x + 4)

function dot4(a, b, len_trunc)
    z1 = z2 = z3 = z4 = zero(a[1] * b[1])
    i = 1
    @inbounds while i < len_trunc
        z1 += a[i] * b[i]
        z2 += a[i+1] * b[i+1]
        z3 += a[i+2] * b[i+2]
        z4 += a[i+3] * b[i+3]
        i += 4
    end
    z1 + z2 + z3 + z4
end

function dot_tail(z, a, b, i, len)
    @inbounds while i < len
        i += 1
        z += a[i] * b[i]
    end
    z
end

function mydot(a, b)
    @assert !Base.has_offset_axes(a, b)
    len = length(a)
    @assert len == length(b)
    len_trunc = len - len % 4
    z = dot4(a, b, len_trunc)
    dot_tail(z, a, b, len_trunc, len)
end

o = randn(69)
Do = _dual.(o)

@btime dot(o, Do)               # 97ns
@btime mydot(o, Do)             # 69ns

It is basically to allow 4 accumulators to run “in parallel” since they have no dependencies on each other. Similar to what the @simd macro allows

1 Like