Hi all,

I have an optimization problem for which I am using JuMP with Ipopt at the moment. The objective function involves 7 convolutions of two vectors. These convolutions are at the moment the bottleneck in performance, so I’d like to improve that situation and I hope, you guys can help me.

The first vector, let’s call it signal, has a length of 30000 elements, and the kernel has a length of about 500. For 6 of the 7 convolutions, the kernel is the same (but changes from iteration to iteration, as this is one of the decision variables for the underlying optimization problem).

Benchmarking some implementations of the convolution, I noticed that for the given sizes of signal and kernel, the DSP.conv method (which I guess uses FFT) is by far the fastest, but unfortunately does not work with ForwardDiff. FastConv.jl does work, but is much slower than DSP.conv. I tried playing around with LoopVectorizations.jl, but this also fails as it does not work with Dual Numbers.

Here is a MWE. Note that I included the relevant functions of FastConv.jl directly, as it is not available in the registry:

```
using ForwardDiff
using DSP
using Base.Cartesian
using BenchmarkTools
# direct version (do not check if threshold is satisfied)
@generated function fastconv(E::Array{T,N}, k::Array{T,N}) where {T,N}
quote
retsize = [size(E)...] + [size(k)...] .- 1
retsize = tuple(retsize...)
ret = zeros(T, retsize)
convn!(ret,E,k)
return ret
end
end
# in place helper operation to speedup memory allocations
@generated function convn!(out::Array{T}, E::Array{T,N}, k::Array{T,N}) where {T,N}
quote
@inbounds begin
@nloops $N x E begin
@nloops $N i k begin
(@nref $N out d->(x_d + i_d - 1)) += (@nref $N E x) * (@nref $N k i)
end
end
end
return out
end
end
function objectiveWithDirectConvolution(x::T...) where {T<:Real}
sig = convert(Vector{T},rand(30000))
kern = convert(Vector{T},rand(500))
out = fastconv(sig,kern)
out = sum(out)
return out
end
function objectiveWithDSPConvolution(x::T...) where {T<:Real}
sig = convert(Vector{T},rand(30000))
kern = convert(Vector{T},rand(500))
out = DSP.conv(sig,kern)
return sum(out)
end
print("Evaluation, DirectConvolution: ")
@btime objectiveWithDirectConvolution(1.,1.)
print("Evaluation, DSP.conv: ")
@btime objectiveWithDSPConvolution(1.,1.)
print("Gradient, DirectConvolution: ")
@btime ForwardDiff.gradient(x -> objectiveWithDirectConvolution(x...),[5. 1.])
ForwardDiff.gradient(x -> objectiveWithDSPConvolution(x...),[5. 1.])
```

The output is as follows:

```
Evaluation, DirectConvolution: 1.854 ms (12 allocations: 477.14 KiB)
Evaluation, DSP.conv: 285.000 μs (68 allocations: 576.98 KiB)
Gradient, DirectConvolution: 18.745 ms (26 allocations: 1.63 MiB)
ERROR: StackOverflowError:
Stacktrace:
[1] conv(u::Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#155#156", Float64}, Float64, 2}}, v::Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#155#156", Float64}, Float64, 2}}) (repeats 79984 times)
@ DSP C:\Users\greifenstein\.julia\packages\DSP\wENjI\src\dspbase.jl:710
```

Note, that the gradient evaluation takes nearly 10 times as long as the evaluation of the function itself and DSP.conv based implementation is ~6 times faster than the direct method.

Here, I found a hint that it might be possible to handle the ForwardDiff.value and .partials of a Dual number separately, which would allow to treat them as ordinary Float64 arrays? In this case, would it be possible just to convolve the values and partials independently and recombine them to a Dual?

Any hints are highly appreciated!

Cheers

Max