I wrote a naive convolution function for some personal purpose and tested it against Python numpy’s function (which does not use FFT).
On my computer the Julia script is about 7x slower than Python numpy’s version. Could anyone help me understand why it’s slow or how to accelerate the code?
Here is my tentative:
using PyCall
np = pyimport("numpy")
function naive_convol_full!(w, u, v)
if length(u)<length(v)
naive_convol_full!(w, v, u)
else
n=length(u)
m=length(v)
for i=0:n+m-2
k=max(i+1-m, 0)
l=min(n-1,i)
w[i+1]=zero(eltype(u))
for j=k:l
w[i+1]+=u[j+1]*v[i+1-j]
end
end
end
end
A = rand(10000); B=rand(10000); D = zeros(Float64, (length(A)+ length(B)-1));
Try to add @inbounds or @simd calls to the hot loops. Currently your implementation is checking if the indices are valid and that slow things down for example.
Also, it looks like your current implementation was copied from a language with 0 indexing.
function naive_convol_full!(w, u, v)
if length(u)<length(v)
return naive_convol_full!(w, v, u)
end
n=length(u)
m=length(v)
@inbounds @fastmath for i=1:n+m-1
w[i]=zero(eltype(u))
for j=max(i-m, 0):(min(n,i) - 1)
w[i]+=u[j+1]*v[i-j]
end
end
end
This takes @juliohm’s suggestion and also changes i+1 to i which makes things a little clearer in my opinion.
If you drill down in the Numpy source code, I think its convolution ends up at multiarraymodule.c:_pyarray_correlate. It looks like it just does a sequence of dot calls, which probably is some optimized SIMD-based function, and possibly uses threads.
Thanks, didn’t know this trick ! Doesn’t make much of a difference though…
function naive_convol_full!(w, u, v)
if length(u)<length(v)
naive_convol_full!(v, u, w)
else
n=length(u)
m=length(v)
@simd for i=0:n+m-2
k=max(i+1-m, 0)
l=min(n-1,i)
@inbounds w[i+1]=zero(eltype(u))
@simd for j=k:l
@inbounds w[i+1]+=u[j+1]*v[i+1-j]
end
end
end
end
Sorry, not copied, I wrote it myself Thanks for the code though !
Just ran it, I gained a few milliseconds but not enough to make the difference
julia> function naive_convol_full!(w, u, v)
if length(u)<length(v)
return naive_convol_full!(w, v, u)
end
n=length(u)
m=length(v)
@inbounds @fastmath for i=1:n+m-1
w[i]=zero(eltype(u))
for j=max(i-m, 0):(min(n,i) - 1)
w[i]+=u[j+1]*v[i-j]
end
end
end
naive_convol_full! (generic function with 1 method)
julia> @time naive_convol_full!(D, A,B)
0.142548 seconds (35.08 k allocations: 1.943 MiB, 11.77% compilation time)
julia> @time naive_convol_full!(D, A,B)
0.126336 seconds
Thanks, I couldn’t figure out how to use @turbo from LoopVectorization but I got a 3x speedup by replacing the innermost loop by a dot product (following your suggestion). That’ll have to do for now ! The numpy call is still about 2.2x faster than the Julia one though.
Thank you all for your help — I’m still curious if anyone can figure out a better way!
using LinearAlgebra, PyCall
np = pyimport("numpy")
function naive_convol_full!(w, u, v)
if length(u)<length(v)
return naive_convol_full!(w, v, u)
end
n=length(u)
m=length(v)
@inbounds for i=1:n+m-1
w[i]=zero(eltype(u))
@simd for j=max(i-m, 0):(min(n,i) - 1)
w[i]+=u[j+1]*v[i-j]
end
end
end
function matmul_convol_full!(w, u, v)
if length(u)<length(v)
matmul_convol_full!(w, v, u)
else
n=length(u)
m=length(v)
@inbounds for i=1:n+m-1
k=max(i-m, 0)
l=(min(n,i) - 1)
@views w[i]= dot(u[k+1:l+1], v[i-k:-1:i-l])
end
end
end
A = rand(10000); B=rand(10000); D = zeros(Float64, (length(A)+ length(B)-1));
julia> @time naive_convol_full!(D, A, B)
0.128569 seconds
julia> @time matmul_convol_full!(D, A, B)
0.041894 seconds
julia> @time C=np.convolve(A, B, "full");
0.018964 seconds (41 allocations: 158.094 KiB)
The naive loop is faster than numpy, just make sure it gets SIMD’ed:
function naive_convol_full!(w, u, v)
if length(u) < length(v)
return naive_convol_full!(w, v, u)
end
n = length(u)
m = length(v)
for i = 1:n+m-1
s = zero(eltype(u))
@simd for j = max(i-m,0):min(n,i)-1
s += u[j+1] * v[i-j]
end
w[i] = s
end
end
@btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=1
15.383 ms (0 allocations: 0 bytes)
And if you use @tturbo from LoopVectorizations , you get it even faster:
...
@tturbo for j = max(i-m,0):min(n,i)-1
s += u[j+1] * v[i-j]
end
...
@btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=1
9.398 ms (0 allocations: 0 bytes)
function naive_convol_full!(w, u, v)
if length(u) < length(v)
return naive_convol_full!(w, v, u)
end
n = length(u)
m = length(v)
@inbounds for i = 1:n+m-1
s = zero(eltype(u))
@tturbo for j = max(i-m,0):min(n,i)-1
s += u[j+1] * v[i-j]
end
w[i] = s
end
end
I suspected as well, thanks for confirming I’ve tried to shut down this feature of numpy but didn’t succeed. Thanks again, I don’t have any further questions !
Yep, found that one but couldn’t figure out which BLAS library it’s using on my system (up-to-date Debian stable with Python 3.9.2). I’ll give it more thought.
In [3]: numpy.__config__.show()
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
accelerate_info:
NOT AVAILABLE
blas_info:
libraries = ['blas', 'blas']
library_dirs = ['/usr/lib/x86_64-linux-gnu']
include_dirs = ['/usr/local/include', '/usr/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['blas', 'blas']
library_dirs = ['/usr/lib/x86_64-linux-gnu']
include_dirs = ['/usr/local/include', '/usr/include']
language = c
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
openblas_clapack_info:
NOT AVAILABLE
flame_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
lapack_info:
libraries = ['lapack', 'lapack']
library_dirs = ['/usr/lib/x86_64-linux-gnu']
language = f77
lapack_opt_info:
libraries = ['lapack', 'lapack', 'blas', 'blas']
library_dirs = ['/usr/lib/x86_64-linux-gnu']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/usr/local/include', '/usr/include']