# Efficient implementation of a convolution within a auto-differentiable objective function

Hi all,

I have an optimization problem for which I am using JuMP with Ipopt at the moment. The objective function involves 7 convolutions of two vectors. These convolutions are at the moment the bottleneck in performance, so I’d like to improve that situation and I hope, you guys can help me.

The first vector, let’s call it signal, has a length of 30000 elements, and the kernel has a length of about 500. For 6 of the 7 convolutions, the kernel is the same (but changes from iteration to iteration, as this is one of the decision variables for the underlying optimization problem).
Benchmarking some implementations of the convolution, I noticed that for the given sizes of signal and kernel, the DSP.conv method (which I guess uses FFT) is by far the fastest, but unfortunately does not work with ForwardDiff. FastConv.jl does work, but is much slower than DSP.conv. I tried playing around with LoopVectorizations.jl, but this also fails as it does not work with Dual Numbers.

Here is a MWE. Note that I included the relevant functions of FastConv.jl directly, as it is not available in the registry:

using ForwardDiff
using DSP
using Base.Cartesian
using BenchmarkTools

# direct version (do not check if threshold is satisfied)
@generated function fastconv(E::Array{T,N}, k::Array{T,N}) where {T,N}
quote

retsize = [size(E)...] + [size(k)...] .- 1
retsize = tuple(retsize...)
ret = zeros(T, retsize)

convn!(ret,E,k)

return ret

end
end

# in place helper operation to speedup memory allocations
@generated function convn!(out::Array{T}, E::Array{T,N}, k::Array{T,N}) where {T,N}
quote
@inbounds begin
@nloops $N x E begin @nloops$N i k begin
(@nref $N out d->(x_d + i_d - 1)) += (@nref$N E x) * (@nref $N k i) end end end return out end end function objectiveWithDirectConvolution(x::T...) where {T<:Real} sig = convert(Vector{T},rand(30000)) kern = convert(Vector{T},rand(500)) out = fastconv(sig,kern) out = sum(out) return out end function objectiveWithDSPConvolution(x::T...) where {T<:Real} sig = convert(Vector{T},rand(30000)) kern = convert(Vector{T},rand(500)) out = DSP.conv(sig,kern) return sum(out) end print("Evaluation, DirectConvolution: ") @btime objectiveWithDirectConvolution(1.,1.) print("Evaluation, DSP.conv: ") @btime objectiveWithDSPConvolution(1.,1.) print("Gradient, DirectConvolution: ") @btime ForwardDiff.gradient(x -> objectiveWithDirectConvolution(x...),[5. 1.]) ForwardDiff.gradient(x -> objectiveWithDSPConvolution(x...),[5. 1.])  The output is as follows: Evaluation, DirectConvolution: 1.854 ms (12 allocations: 477.14 KiB) Evaluation, DSP.conv: 285.000 μs (68 allocations: 576.98 KiB) Gradient, DirectConvolution: 18.745 ms (26 allocations: 1.63 MiB) ERROR: StackOverflowError: Stacktrace:  conv(u::Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#155#156", Float64}, Float64, 2}}, v::Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#155#156", Float64}, Float64, 2}}) (repeats 79984 times) @ DSP C:\Users\greifenstein\.julia\packages\DSP\wENjI\src\dspbase.jl:710  Note, that the gradient evaluation takes nearly 10 times as long as the evaluation of the function itself and DSP.conv based implementation is ~6 times faster than the direct method. Here, I found a hint that it might be possible to handle the ForwardDiff.value and .partials of a Dual number separately, which would allow to treat them as ordinary Float64 arrays? In this case, would it be possible just to convolve the values and partials independently and recombine them to a Dual? Any hints are highly appreciated! Cheers Max I didn’t run your code because I don’t understand it. It seems like objectiveWithDSPConvolution doesn’t actually use x? If you have a high-dimensional x, I think the answer is “don’t use ForwardDiff.” Use a different AD package to compute the gradient (or come up with an analytical solution), then register both the function and the gradient: People might also be able to help if you can provide a reproducible example. Hi @odow , thanks for your quick reply. Sorry, maybe the MWE was a little confusing as it indeed does not use x but only random numbers. It was merely to demonstrate, that DSP.conv cannot be used inside a function that will be used with ForwardDiff (thus the error shown in the output). I intentionally left out all details concerning the model to boil it down to the sub-problem I am currently working on, which is simply to make it run faster by optimizing the convolution. Here is an updated version of the MWE from above which makes a little more sense and is closer to what the real problem is. Unfortunately, the real problem cannot be differentiated analytically - and as far as I understood, Zygote cannot handle DSP.conv as well. New MWE: using ForwardDiff using DSP using BenchmarkTools using Base.Cartesian # direct version (do not check if threshold is satisfied) @generated function fastconv(E::Array{T,N}, k::Array{T,N}) where {T,N} quote retsize = [size(E)...] + [size(k)...] .- 1 retsize = tuple(retsize...) ret = zeros(T, retsize) convn!(ret,E,k) return ret end end # in place helper operation to speedup memory allocations @generated function convn!(out::Array{T}, E::Array{T,N}, k::Array{T,N}) where {T,N} quote @inbounds begin @nloops$N x E begin
@nloops $N i k begin (@nref$N out d->(x_d + i_d - 1)) += (@nref $N E x) * (@nref$N k i)
end
end
end
return out
end
end

function gaussiankernel(wid)
x = collect(-10:0.01:10)
# get gaussian
u = 0.6006
g = @. exp(-((x)./(u.*wid)) .^2);
# normalize
g = g./sum(g);
return g
end

function objectiveWithDirectConvolution(x::T...) where {T<:Real}

# six different input signals
sig11 = convert(Vector{T},rand(30000))
sig12 = convert(Vector{T},rand(30000))
sig13 = convert(Vector{T},rand(30000))
sig21 = convert(Vector{T},rand(30000))
sig22 = convert(Vector{T},rand(30000))
sig23 = convert(Vector{T},rand(30000))
# two different convolution kernels
kern1 = convert(Vector{T},gaussiankernel(x))
kern2 = convert(Vector{T},gaussiankernel(x))

# final signal is composed by summing up convolved parts
χ = fastconv(sig11,kern1)+
fastconv(sig12,kern1)+
fastconv(sig13,kern1)+
fastconv(sig21,kern2)+
fastconv(sig22,kern2)+
fastconv(sig23,kern2)

# take some random numbers as an experiment
experiment = convert(Vector{T},rand(length(χ)))

# minimize sum of squares
out = sum(χ.^2 .- experiment.^2)

return out

end

function objectiveWithDSPConvolution(x::T...) where {T<:Real}

# six different input signals
sig11 = convert(Vector{T},rand(30000))
sig12 = convert(Vector{T},rand(30000))
sig13 = convert(Vector{T},rand(30000))
sig21 = convert(Vector{T},rand(30000))
sig22 = convert(Vector{T},rand(30000))
sig23 = convert(Vector{T},rand(30000))
# two different convolution kernels
kern1 = convert(Vector{T},gaussiankernel(x))
kern2 = convert(Vector{T},gaussiankernel(x))

# final signal is composed by summing up convolved parts
χ = DSP.conv(sig11,kern1)+
DSP.conv(sig12,kern1)+
DSP.conv(sig13,kern1)+
DSP.conv(sig21,kern2)+
DSP.conv(sig22,kern2)+
DSP.conv(sig23,kern2)

# take some random numbers as an experiment
experiment = convert(Vector{T},rand(length(χ)))

# minimize sum of squares
out = sum(χ.^2 .- experiment.^2)

return out

end

print("Evaluation, DirectConvolution: ")
@btime objectiveWithDirectConvolution(1.,1.)
print("Evaluation, DSP.conv: ")
@btime objectiveWithDSPConvolution(1.,1.)



Output:

Evaluation, DirectConvolution:   45.959 ms (84 allocations: 3.67 MiB)
Evaluation, DSP.conv:   4.364 ms (445 allocations: 11.93 MiB)
Gradient, DirectConvolution:   463.989 ms (113 allocations: 12.55 MiB)
ERROR: StackOverflowError:
Stacktrace:
 conv(u::Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#47#48", Float64}, Float64, 2}}, v::Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#47#48", Float64}, Float64, 2}}) (repeats 79984 times)
@ DSP C:\Users\greifenstein\.julia\packages\DSP\wENjI\src\dspbase.jl:710


The result obviously is the same - cannot use DSP.conv with ForwardDiff But maybe the MWE now better shows, what the purpose of the actual code is.
The question really is:
Can I somehow speed this up? I managed to use LoopVectorizations.jl with @turbo and @avxt macros to get a single convolution with a direct implementation (for these signal and kernel sizes) to be as fast (or slightly faster) than DSP.conv(). However, this also breaks when the inputs are of type Dual.
In other words:
Is there any way to treat the parts of a dual number independently, do some calculcations with them, and store it back in the way ForwardDiff/JuMP expects it?

Cheers
Max

• You shouldn’t be randomly generating your signal every evaluation?
• You shouldn’t need all the converts.

I might have made a typo, but you should be able to do something like:

using ForwardDiff

const N = 30_000
const M = 2001

function my_conv!(ret, E::AbstractVector, k::AbstractVector)
@inbounds for (i, ei) in enumerate(E), (j, kj) in enumerate(k)
ret[i + j - 1] += ei * kj
end
return
end

function gaussian_kernel(wid)
g = [exp(-(factor * xi)^2) for xi in range(-10, 10; length = M)]
total = sum(g)
g ./= total
return g
end

const signals = rand(N, 2, 3)
const experiment = rand(M + N - 1)

function my_objective(x::T...) where {T<:Real}
kern = gaussian_kernel.(x)
χ = zeros(T, length(experiment))
for i in 1:2, j in 1:3
my_conv!(χ, @view(signals[:, i, j]), kern)
end
return sum((χ[i] - experiment[i])^2 for i in 1:length(experiment))
end

my_objective(x::Vector{T}) where {T<:Real} = my_objective(x...)

@time my_objective([5.0, 3.0])


So yes, there is a 5x penalty, but that seems reasonable for ForwardDiff. It essentially has to do the calculation twice, once for each x element, and the cost of an evaluation is approximately 2x, because it has to propagate the primal and the dual.

This is just a limitation of FowardDiff.

Hi @odow ,
thanks again for your quick help!
Ok, it’s true that regenerating the signals is not required, but for the problem I am facing (which is just a performance issue), it doesn’t really matter. Obviously, for the real optimization problem, there are no random numbers involved but at every iteration, 6 different arrays are extracted from a library of spectra for the convolution.
The converts…well…I had trouble sometimes with type issues (because of the library spectra which are Float32), and did not remove them for the minimal example. Thanks for pointing that out.

I did run the code you provided - and while the function evaluation is very comparable with ~50 ms on my machine using btime, your code runs more than twice as fast (200 ms vs 472 ms) for evaluating the gradient - which is already much better, thank you!

However, I think I still did not make my original point clear enough. I do accept and understand, that getting the gradient with ForwardDiff will take longer than just evaluating the function at a single point for the exact reasons that you mention.

The problem is, due to FFT/DSP.conv() not handling Dual numbers, I have to fall back to the direct convolution implementation, which is painfully slow. Even considering a 5x penalty, if there was a way to use FFT based convolution with dual numbers, it should take in the ballpark of 15 ms. Not 200 ms.

So it is not the ForwardDiff part I want to optimize - I need an efficient way to convolve Dual Numbers.

In the later application, we will have to fit our model to tens of thousands of experimental spectra and the convolution is by far the bottleneck. Getting this to run faster is really the difference between waiting for a couple of days or weeks Cheers
Max

I’d keep going, and come back to this once you have concrete evidence that the gradient is the bottleneck.

In the worst case, it seems like you could derive the analytic gradient with a little effort.

Mh, I thought the whole point of providing a minimal working example to demonstrate the problem was to isolate the identified issue in a more complex code. Obviously, this leads to information loss - such as which other parts might be slowing down the computation. Here is the entire objective function that I have to minimize:

function jumpResidual(x::T...) where {T<:Real}
#function jumpResidual(x::Vector,a::T) where {T}
global globlibs,globexp,ω,Ie,loc,params,evalfloat,evaldual
print("Allocation: ")
@time begin
# DeVectorization of fitting parameters
devectorizeParams!(params,x,loc)
#params.T = params.T*10
#params.X["N2"] = params.X["N2"]/1000

# Library parameters
available_species = unique(vcat([collect(keys(lib)) for lib in globlibs]...))
γ = convert(T,globlibs[available_species].γ)
ω = globlibs[collect(keys(globlibs))].ω
ωres = convert(T,globlibs[available_species].ωres)   # wavenumber resolution used in library
num_pr = length(globlibs)   # number of pumping regions

# get the temperature index
# if interpolation is enabled, get the two neighboring ones
if CARSOptions["Interpolate"]
num_T = 1
idx0 = findlower(globlibs[available_species].T,params.T)
idx1 = idx0+1
idxs = [idx0 idx1]
else
num_T = 1
idxs = findnearest(globlibs[available_species].T,params.T)
end

# PREALLOCATIONS (optimize with memoizing?)
χM = zeros(T,num_pr,length(ω),num_T)
χR = zeros(T,num_pr,length(ω),num_T)
χI = zeros(T,num_pr,length(ω),num_T)
I = zeros(T,length(ω),num_T)

# depending on the input type, some operations are implemented differently
# the automatic differentiation in ForwardDiff uses the Dual type (which is <:Real)
# but for function evaluations, the type may be Float64/Float32. This allows to use
# optimizations from LoopVectorization.jl such as @avxt or @tturbo which makes it much faster.
dual = occursin("Dual",string(typeof(I)))
#χNR = convert(T,0)
P = globlibs[available_species].P
N = NA*273.15/params.T*(P/molvol);          # number density based on temperature and pressure
end
print("NR Signal:")
@time begin
# add the correct non-resonant signal contributions as a dictionary to params
χNR = Dict(name => convert(T,0) for name in available_species)
for species in available_species
# find species in one of the libraries
χNR[species]=globlibs[species].χNR
continue
χNR[species]=globlibs[species].χNR
continue
else
error("Couldn't find χNR for species \$(species). Aborting.")
end
end
# store in params struct
params.χNR = χNR

χB = params.χB/NA*molvol;
XB = 1-sum(values(params.X))
χNR = convert(T,N*(XB*χB + sum([params.X[species]*params.χNR[species] for species in available_species])))
end
# library access
print("Get Tinterp: ")
@time begin
T0 = globlibs[available_species].T[idxs]
T1 = globlibs[available_species].T[idxs]
Tx = params.T

w0 = (T1-Tx)/(T1-T0)
w1 = (Tx-T0)/(T1-T0)
end
#for i in 1:2
print("Library Access:")
@time for pr in 1:size(χM,1)
for species in available_species
X = convert(T,params.X[species]);
if dual
@. χM[pr,:] += (w0*globlibs[pr][species].χM[idxs,:]+w1*globlibs[pr][species].χM[idxs,:]) * X^2
@. χR[pr,:] += (w0*globlibs[pr][species].χR[idxs,:]+w1*globlibs[pr][species].χR[idxs,:]) * X
@. χI[pr,:] += (w0*globlibs[pr][species].χI[idxs,:]+w1*globlibs[pr][species].χI[idxs,:]) * X
else
@avxt for j in 1:size(χM,2); χM[pr,j] += (w0*globlibs[pr][species].χM[idxs,j]+w1*globlibs[pr][species].χM[idxs,j]) * X^2;end
@avxt for j in 1:size(χM,2); χR[pr,j] += (w0*globlibs[pr][species].χR[idxs,j]+w1*globlibs[pr][species].χR[idxs,j]) * X;end
@avxt for j in 1:size(χM,2); χI[pr,j] += (w0*globlibs[pr][species].χI[idxs,j]+w1*globlibs[pr][species].χI[idxs,j]) * X;end
end
end
end
end
#end
i=1
# convolution to final laser linewidths

print("Convolution: ")
@time for pr in 1:size(χM,1)
# convolve each pumping region with the "leftover" gaussian kernel
kern = gaussiankernelT(ωres,sqrt(params.LineWidth[pr]^2-γ^2));
χM[pr,:,i]=fastconv(χM[pr,:,i],kern,dual)
χI[pr,:,i]=fastconv(χI[pr,:,i],kern,dual)
χR[pr,:,i]=fastconv(χR[pr,:,i],kern,dual)
end

if dual
print("DUAL Summation: ")
evaldual += 1
@time if size(χM,1) == 1  # single pump case: reduced from dual pump case with 1 == 2
@. I[:,i] = 2*(χM[1,:,i] + 2*χNR*χR[1,:,i] + χNR^2) +
2*(χR[1,:,i]^2 + 2*χNR*χR[1,:,i] + χNR^2 + χI[1,:,i]^2)

else                # dual pump case. divide amplitudes by 2? or is the non resonant background added twice and everything is fine?
@. I[:,i] = χM[1,:,i] + 2*χNR*χR[1,:,i] + χNR^2 +
χM[2,:,i] + 2*χNR*χR[2,:,i] + χNR^2 +
2*(χR[1,:,i]*χR[2,:,i] + χNR*χR[1,:,i] + χNR*χR[2,:,i] + χNR^2 + χI[1,:,i]*χI[2,:,i])

end

else
print("AVXT Summation: ")
evalfloat += 1
if size(χM,1) == 1  # single pump case: reduced from dual pump case with 1 == 2
@time @avxt for j in 1:size(I,1)
I[j,i] = 2*(χM[1,j,i] + 2*χNR*χR[1,j,i] + χNR^2) +
2*(χR[1,j,i]^2 + 2*χNR*χR[1,j,i] + χNR^2 + χI[1,j,i]^2)
end
else
@time @avxt for j in 1:size(I,1)
I[j,i] = χM[1,j,i] + 2*χNR*χR[1,j,i] + χNR^2 +
χM[2,j,i] + 2*χNR*χR[2,j,i] + χNR^2 +
2*(χR[1,j,i]*χR[2,j,i] + χNR*χR[1,j,i] + χNR*χR[2,j,i] + χNR^2 + χI[1,j,i]*χI[2,j,i])
end
end
end
#end

# if interpolation is used, do it now. better to just interpolate one final spectrum rather than up to 6.
#=
print("Temp. Interpolation: ")
@time if CARSOptions["Interpolate"]
T0 = globlibs[available_species].T[idxs]
T1 = globlibs[available_species].T[idxs]
Tx = params.T
# overwrite first element in I to save memory
I[:,1] = I[:,1] + (Tx-T0)/(T1-T0) * (I[:,2]-I[:,1])
end
#println([params.T, params.X["N2"]])
=#
# do the last convolution
print("Final convolution: ")
@time I[:,1] = fastconv(I[:,1],gaussiankernelT(ωres,convert(T,params.Instrumental)),dual)

print("Rest: ")
@time begin
@. I[:,1] = sqrt(I[:,1])

# interpolate on final grid
intf = linear_interpolation(ω, I[:,1])
Ie = intf[globexp[:,1]]

# normalize to sum
Ie = Ie ./ sum(Ie)

# calculate the residual
res = globexp[:,2] - Ie
val = sum(res.^2)
#println(params.T)
#println(val)
end
return val
end


This function is heavy work in progress and there are useless parts in it - I know that. These are artifacts from attempting to use different kinds of solvers, which may not need a gradient at all, such as BlackBoxOptim or a Genetic Algorithm.
The code is at the moment wrapped in begin...end blocks to time them individually.

I do already treat Dual numbers and Floats differently (and yes, this can be done more elegantly by having two functions accepting different types, this was just to try it out and see if there is some benefit), which leads to some performance gain for the cases where I simply have to evaluate the function using Floats.

If I run this code, that is evaluating the function and getting the gradient with respect to 4 decision variables:

T = rand(300.:2000.)
X1 = rand(0.:0.001:1.)
X2 = rand(0.:0.001:1.)
L = rand(0.8:0.001:1.2)
println("####### FUNCEVAL #######")
@time jumpResidual(T,X1,X2,L)


This is the output after running it multiple times to get rid of compilation time:

####### FUNCEVAL #######
Allocation:   0.002627 seconds (308 allocations: 1.643 MiB)
NR Signal:  0.000106 seconds (30 allocations: 1.328 KiB)
Get Tinterp:   0.000031 seconds (10 allocations: 480 bytes)
Library Access:  0.002455 seconds (231 allocations: 9.156 KiB)
Convolution:   0.006463 seconds (513 allocations: 6.154 MiB)
AVXT Summation:   0.000353 seconds (45 allocations: 1.688 KiB)
Final convolution:   0.001642 seconds (84 allocations: 1.129 MiB)
Rest:   0.005513 seconds (57 allocations: 1.943 MiB)
0.025435 seconds (1.50 k allocations: 10.895 MiB)
Allocation:   0.006881 seconds (352 allocations: 8.054 MiB)
NR Signal:  0.000121 seconds (30 allocations: 2.234 KiB)
Get Tinterp:   0.000040 seconds (10 allocations: 608 bytes)
Library Access:  0.013246 seconds (188 allocations: 8.252 MiB)
Convolution:   0.304534 seconds (135 allocations: 20.833 MiB)
DUAL Summation:   0.006778 seconds (116 allocations: 11.453 MiB)
Final convolution:   0.061118 seconds (23 allocations: 3.518 MiB)
Rest:   0.005144 seconds (57 allocations: 7.131 MiB)
0.403787 seconds (1.13 k allocations: 59.259 MiB)


So yes, the gradient is the bottleneck - and within the gradient, it is the convolution, which is 50 times slower than for the single evaluation, because I have to fall back to direct convolution instead of using DSP.conv()/LoopVectorization/FFT because none of them can handle Dual Numbers.

To see how often the function is evaluated with Dual Number vs. Float64, I keep track of the number of calls with each input type during an optimization using the globals evaldual, evalfloat and it turns out, that even though Ipopt returns the following:

Number of objective function evaluations             = 107
Number of objective gradient evaluations             = 27
Number of equality constraint evaluations            = 0
Number of inequality constraint evaluations          = 0
Number of equality constraint Jacobian evaluations   = 0
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations             = 0
Total seconds in IPOPT                               = 21.346

EXIT: Optimal Solution Found.

println([evaldual evalfloat])
[84 85]


Meaning that every other call is using Dual Numbers as input type.

And again, it is not the gradient calculation itself, that slows me down. It is that I cannot use an efficient way to calculate the convolution of two Dual number vectors during the calculation of the gradient. I totally accept that there is a penalty on calculating the gradient - but the difference in runtime just arises from the aforementioned limitation of Dual Number convolution.

Providing an analytical gradient - well, I did think about it and for some of the parameters, it is actually not that hard. But for example T, which is the temperature dependency of the spectra, this get’s really tricky. In part, this is due to the strategy of accessing a library of precomputed spectra for different molecules, which are all just f(T) but have a vastly different physical model behind them and thus a very different gradient with respect to T. And the whole point of using this library is to get rid of the really computationally expensive stuff. Essentially, for every molecule we have to solve a complex Lorentzian on a much finer grid resolution for every transition in the area of interest. For CO2, this is in the order of 10,000 transitions, each on an array with 400,000 Elements. Depending on the model parameters, these individual transitions all depend on temperature - but not in the same way. So no, getting the gradient analytically is really not an option.

Ok, I think I found a way:

using DSP
using ForwardDiff

function gaussian_kernel(wid)
g = [exp(-(xi/(0.6006*wid))^2) for xi in range(-10, 10; length = M)]
total = sum(g)
g ./= total
return g
end

const N = 30000
const M = 500
const signal = rand(N)
const experiment = DSP.conv(signal,gaussian_kernel(5.))

function my_objective_DSP(x::T...) where {T<:Real}
kern = gaussian_kernel.(x)
# DSP version
χ=DSP.conv(signal, kern)
# do something with the leftover xs
χ = sum([χ .+ x[i] for i = 1:length(x)])
return sum((χ .- experiment).^2)
end

function my_conv!(ret, E::AbstractVector, k::AbstractVector)
@inbounds for (i, ei) in enumerate(E), (j, kj) in enumerate(k)
ret[i + j - 1] += ei * kj
end
return
end

function my_objective_direct(x::T...) where {T<:Real}
kern = gaussian_kernel.(x)
# Direct version
χ=zeros(T,length(experiment))
my_conv!(χ, signal, kern)
# do something with the leftover xs
χ = sum([χ .+ x[i] for i = 1:length(x)])
return sum((χ .- experiment).^2)
end

function dualConv(signal::Vector{T}, a::Vector{T}) where {T<:Float64}
# if the input is Float64, just use DSP.conv. No need for fancy stuff.
return DSP.conv(signal,a)
end

function dualConv(signal,a::AbstractVector{<:ForwardDiff.Dual{T,V,K}}) where {T,V,K}
# convolve a vector of type V with
Xre   = reinterpret(reshape, Float64, a)
# convolve the signal with values and partials
# this assumes d/dt(f⋆g) = df/dt ⋆ g
values = DSP.conv(signal, Xre[1,:]) # values
partials = [DSP.conv(signal, Xre[i,:]) for i in 2:size(Xre,1)] # partials
# reconstruct the partials
ptl = [ForwardDiff.Partials{K,V}(NTuple{K,V}([partials[j][i] for j in 1:length(partials)])) for i in 1:length(values)]
# return a dual type with the same type as the input - as if nothing happened.
return [ForwardDiff.Dual{T,V,K}(values[i], ptl[i]) for i in 1:length(values)]
end

function my_objective_smart(x::T...) where {T<:Real}
global χVs, χPs, χVd, χPd
kern = gaussian_kernel.(x)
χ = dualConv(signal, kern)
# do something with the leftover xs
χ = sum([χ .+ x[i] for i = 1:length(x)])
return sum((χ .- experiment).^2)
end

my_objective_direct(x::Vector{T}) where {T<:Real} = my_objective_direct(x...)
my_objective_DSP(x::Vector{T}) where {T<:Real} = my_objective_DSP(x...)
my_objective_smart(x::Vector{T}) where {T<:Real} = my_objective_smart(x...)

args = [3.0,5.,10.,50.]

a1 = my_objective_direct(args)
a2 = my_objective_DSP(args)
a3 = my_objective_smart(args)
println("######## EVALUATE FUNCTIONS ########")
print("Results for evaluation are equivalent: ")
println([a1,a2,a3,a1 ≈ a2,a1≈a3])
print("Direct convolution: ")
@time my_objective_direct(args)
print("DSP convolution: ")
@time my_objective_DSP(args)
print("Smart convolution: ")
@time my_objective_smart(args)

print("Direct convolution: ")
print("Smart convolution: ")
println([g1,g2])
g1≈g2



In this approach, dualConv makes use of two ideas:

Xre   = reinterpret(reshape, Float64, a)


This allows treating the values and partials just as if they were ordinary Float64s for the convolution.
Furthermore assuming

\frac{d}{dt}(f \star g) = \frac{d}{dt}f \star g = f \star \frac{d}{dt}g

allows to convolve the signal (which is Float64 in this case) with the value and partials of the Dual number individually using DSP.conv().
One is then left with the task to reconstruct an array of Duals that ForwardDiff is happy with:

ptl = [ForwardDiff.Partials{K,V}(NTuple{K,V}([partials[j][i] for j in 1:length(partials)])) for i in 1:length(values)]
# return a dual type with the same type as the input - as if nothing happened.
return [ForwardDiff.Dual{T,V,K}(values[i], ptl[i]) for i in 1:length(values)]


Et voila, the result is what I hoped to achieve:

a1 = my_objective_direct(args)
a2 = my_objective_DSP(args)
a3 = my_objective_smart(args)
println("######## EVALUATE FUNCTIONS ########")
print("Results for evaluation are equivalent: ")
println([a1,a2,a3,a1 ≈ a2,a1≈a3])
print("Direct convolution: ")
@time my_objective_direct(args)
print("DSP convolution: ")
@time my_objective_DSP(args)
print("Smart convolution: ")
@time my_objective_smart(args)

print("Direct convolution: ")
print("Smart convolution: ")
println([g1,g2])
g1≈g2

######## EVALUATE FUNCTIONS ########
Results for evaluation are equivalent: [1.4721889828856158e8, 1.4721889828856158e8, 1.4721889828856158e8, 1.0, 1.0]
Direct convolution:   0.002104 seconds (57 allocations: 2.112 MiB)
DSP convolution:   0.000881 seconds (119 allocations: 2.210 MiB)
Smart convolution:   0.001075 seconds (119 allocations: 2.210 MiB)
Direct convolution:   0.038000 seconds (65 allocations: 10.551 MiB, 26.31% gc time)
Smart convolution:   0.005152 seconds (30.89 k allocations: 15.947 MiB)
Are gradients equal? [[4.237768261388187e6, 4.237918798675122e6, 4.237918798675122e6, 4.237918798675122e6], [4.237768261388187e6, 4.237918798675122e6, 4.237918798675122e6, 4.237918798675122e6]]
true



With 4 variables (only one of which is used for convolution at the moment, the rest is dummy added to χ just to ensure that the function handles multiple partials correctly), I get a speed up of ~7.6.
This strategy might help in many circumstances where a) Dual numbers are not supported directly and b) one can handle the transport of partial derivatives through the function one wants to implement.

Now I just have to make it work for convolving two Duals Cheers
Max

1 Like

Very nice work. We use this trick internally in JuMP to implement forward-over-reverse AD, but I didn’t suggest it because it’s not exactly straight-forward to get working.

If you’re interested, here are the relevant parts in JuMP. They might give you ideas:

Thank you, @odow , I’ll certainly have a look at it! In the meantime, I managed to implement the convolution of two duals. It is not optimal, because it has very many allocations, but it works and I thought to share it here if someone ever needs it:

function dualConv(signal::AbstractVector{<:ForwardDiff.Dual{T,V,K}},a::AbstractVector{<:ForwardDiff.Dual{T,V,K}}) where {T,V,K}
# function to convolve to Dual Numbers
Sre   = reinterpret(reshape, V, signal)
Xre   = reinterpret(reshape, V, a)
values = DSP.conv(Sre[1,:], Xre[1,:]) # values
# these are a lot of convolutions
# at the moment, these are two for each variable that is used for the gradient
partials = [DSP.conv(Sre[1,:], Xre[i,:]) + DSP.conv(Xre[1,:], Sre[i,:]) for i in 2:size(Xre,1)] # partials
ptl = [ForwardDiff.Partials{K,V}(NTuple{K,V}([partials[j][i] for j in 1:length(partials)])) for i in 1:length(values)]
return [ForwardDiff.Dual{T,V,K}(values[i], ptl[i]) for i in 1:length(values)]
end


And here a full (somewhat) minimal example, that shows that the evaluation of the function as well as the gradients are the same. It is not well written, as the amount of arguments has to match the number of convolutions that are performed in the objective functions, but it does show that it works:

using DSP
using ForwardDiff
using BenchmarkTools

function gaussian_kernel(wid::T) where {T}
g = convert(Vector{T},[exp(-(xi/(0.6006*wid))^2) for xi in range(-10, 10; length = M)])
total = sum(g)
g ./= total
return g
end

TY = Float64
const N = 30000
const M = 500
signal = convert(Vector{TY},rand(N))
experiment = convert(Vector{TY},DSP.conv(signal,gaussian_kernel(5.)))
experiment2 = convert(Vector{TY},DSP.conv(experiment,gaussian_kernel(5.)))
experiment3 = convert(Vector{TY},DSP.conv(experiment2,gaussian_kernel(5.)))

function my_objective_DSP(x::T...) where {T<:Real}
kern = gaussian_kernel.(x)
# DSP version
χ=DSP.conv(signal, kern)
χ2=DSP.conv(χ, kern)
χ3=DSP.conv(χ2, kern)
return sum((χ3 .- experiment3).^2)
end

function my_conv!(ret, E::AbstractVector, k::AbstractVector)
@inbounds for (i, ei) in enumerate(E), (j, kj) in enumerate(k)
ret[i + j - 1] += ei * kj
end
return
end

function my_objective_direct(x::T...) where {T<:Real}
kern = gaussian_kernel.(x)
# Direct version
χ=zeros(T,length(experiment))
χ2=zeros(T,length(experiment2))
χ3=zeros(T,length(experiment3))
my_conv!(χ, signal, kern)
my_conv!(χ2, χ, kern)
my_conv!(χ3, χ2, kern)
return sum((χ3 .- experiment3).^2)
end

function my_objective_smart(x::T...) where {T<:Real}
kern = gaussian_kernel.(x)
χ = dualConv(signal, kern)
χ2 = dualConv(χ,kern)
χ3 = dualConv(χ2,kern)
return sum((χ3 .- experiment3).^2)
end

function dualConv(signal::Vector{T}, a::Vector{T}) where {T<:AbstractFloat}
# if the input is Float64, just use DSP.conv. No need for fancy stuff.
return DSP.conv(signal,a)
end

function dualConv(signal::Vector{V},a::AbstractVector{<:ForwardDiff.Dual{T,V,K}}) where {T,V,K}
# convolve a vector of type V with a Dual{T,V,K}
#Sre   = reinterpret(reshape, V, signal)
Xre   = reinterpret(reshape, V, a)
# convolve the signal with values and partials
# this assumes d/dt(f⋆g) = df/dt ⋆ g
values = DSP.conv(signal, Xre[1,:]) # values
partials = [DSP.conv(signal, Xre[i,:]) for i in 2:size(Xre,1)] # partials
ptl = [ForwardDiff.Partials{K,V}(NTuple{K,V}([partials[j][i] for j in 1:length(partials)])) for i in 1:length(values)]
# return a dual type with the same type as the input - as if nothing happened.
return [ForwardDiff.Dual{T,V,K}(values[i], ptl[i]) for i in 1:length(values)]
end

function dualConv(signal::AbstractVector{<:ForwardDiff.Dual{T,V,K}},a::AbstractVector{<:ForwardDiff.Dual{T,V,K}}) where {T,V,K}
# function to convolve to Dual Numbers
Sre   = reinterpret(reshape, V, signal)
Xre   = reinterpret(reshape, V, a)
values = DSP.conv(Sre[1,:], Xre[1,:]) # values
# these are a lot of convolutions
# at the moment, these are two for each variable that is used for the gradient
partials = [DSP.conv(Sre[1,:], Xre[i,:]) + DSP.conv(Xre[1,:], Sre[i,:]) for i in 2:size(Xre,1)] # partials
ptl = [ForwardDiff.Partials{K,V}(NTuple{K,V}([partials[j][i] for j in 1:length(partials)])) for i in 1:length(values)]
return [ForwardDiff.Dual{T,V,K}(values[i], ptl[i]) for i in 1:length(values)]
end

my_objective_direct(x::Vector{T}) where {T<:Real} = my_objective_direct(x...)
my_objective_DSP(x::Vector{T}) where {T<:Real} = my_objective_DSP(x...)
my_objective_smart(x::Vector{T}) where {T<:Real} = my_objective_smart(x...)

args = convert(Vector{TY},[5.0,10.,13.])

a1 = my_objective_direct(args)
a2 = my_objective_DSP(args)
a3 = my_objective_smart(args)
println("######## EVALUATE FUNCTIONS ########")
print("Are the results equal? ")
println(a1 ≈ a2 && a1≈a3)
print("Direct convolution: ")
@time my_objective_direct(args)
print("DSP convolution: ")
@time my_objective_DSP(args)
print("Smart convolution: ")
@time my_objective_smart(args)

print("Direct convolution: ")
print("Smart convolution: ")
println(g1≈g2)
println(g1)
println(g2)

######## EVALUATE FUNCTIONS ########
Are the results equal? true
Direct convolution:   0.006577 seconds (29 allocations: 985.562 KiB)
DSP convolution:   0.002343 seconds (212 allocations: 1.256 MiB)
Smart convolution:   0.002047 seconds (212 allocations: 1.256 MiB)