Optimize dot(::Vector{Float64}, ::Vector{ForwardDiff.Dual})

Tamas_Papp · January 27, 2020, 1:03pm

I identified a bottleneck in my code — calculating dot products of a vector of floats and some ForwardDiff.Duals. I have not been able to improve on the generic dot (which, of course, is not surprising in Julia, as the built-ins are usually efficient).

Here is an MWE in case someone has a suggestion on how to make it faster:

using ForwardDiff, BenchmarkTools

_dual(x) = ForwardDiff.Dual(x, x + 1, x + 2, x + 3, x + 4)
o = ones(69)
Do = _dual.(o)

function mydot(a, b)
    @assert !Base.has_offset_axes(a, b)
    len = length(a)
    @assert len == length(b)
    z = a[1] * b[1]
    for i in 2:len
        z += a[i] * b[i]
    end
    z
end

@btime dot(o, Do)
@btime mydot(o, Do) # doh

Vector lengths reflect typical sizes of my problem. I am using 1.4-rc1, but can use master or anything else if it helps.

kristoffer.carlsson · January 27, 2020, 1:30pm

Just trying something… The code below assumes that the vector length is a multiple of 4 but it is easy to add some extra code to the end if that is not the case.

function mydot(a, b)
    @assert !Base.has_offset_axes(a, b)
    len = length(a)
    @assert len == length(b)
    z1 = zero(a[1] * b[1])
    z2 = z1
    z3 = z1
    z4 = z1
    i = 1
    @inbounds while i < len 
        z1 += a[i] * b[i]
        z2 += a[i+1] * b[i+1]
        z3 += a[i+2] * b[i+2]
        z4 += a[i+3] * b[i+3]
        i += 4
    end
    return z1 + z2 + z3 + z4
end

julia> o = ones(4*16);

julia> Do = _dual.(o);

julia> @btime mydot(o, Do) # doh
  112.132 ns (1 allocation: 48 bytes)
Dual{Nothing}(64.0,128.0,192.0,256.0,320.0)

julia> @btime dot(o, Do) # doh
  144.006 ns (1 allocation: 48 bytes)
Dual{Nothing}(64.0,128.0,192.0,256.0,320.0)

Tamas_Papp · January 27, 2020, 2:51pm

The 4 is for SIMD?

This is very neat:

using ForwardDiff, BenchmarkTools, LinearAlgebra

_dual(x) = ForwardDiff.Dual(x, x + 1, x + 2, x + 3, x + 4)

function dot4(a, b, len_trunc)
    z1 = z2 = z3 = z4 = zero(a[1] * b[1])
    i = 1
    @inbounds while i < len_trunc
        z1 += a[i] * b[i]
        z2 += a[i+1] * b[i+1]
        z3 += a[i+2] * b[i+2]
        z4 += a[i+3] * b[i+3]
        i += 4
    end
    z1 + z2 + z3 + z4
end

function dot_tail(z, a, b, i, len)
    @inbounds while i < len
        i += 1
        z += a[i] * b[i]
    end
    z
end

function mydot(a, b)
    @assert !Base.has_offset_axes(a, b)
    len = length(a)
    @assert len == length(b)
    len_trunc = len - len % 4
    z = dot4(a, b, len_trunc)
    dot_tail(z, a, b, len_trunc, len)
end

o = randn(69)
Do = _dual.(o)

@btime dot(o, Do)               # 97ns
@btime mydot(o, Do)             # 69ns

kristoffer.carlsson · January 27, 2020, 2:56pm

It is basically to allow 4 accumulators to run “in parallel” since they have no dependencies on each other. Similar to what the @simd macro allows

Topic		Replies	Views
Calculating "triple" dot products Performance	14	307	January 22, 2025
Innefficient paralellization? Need some help optimizing a simple dot product Performance question , parallel	32	4745	March 28, 2018
Naive dot product faster in Fortran than in Juila Performance	12	1381	July 24, 2021
How to make Julia slow? Performance question	11	1988	September 16, 2019
Question: how to get dot products from size(x, n) matrix and length(n) vector Statistics	3	349	June 23, 2022

Optimize dot(::Vector{Float64}, ::Vector{ForwardDiff.Dual})

Related topics