Performance of naive convolution against Python Numpy

I wrote a naive convolution function for some personal purpose and tested it against Python numpy’s function (which does not use FFT).
On my computer the Julia script is about 7x slower than Python numpy’s version. Could anyone help me understand why it’s slow or how to accelerate the code?
Here is my tentative:

using PyCall
np = pyimport("numpy")

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		naive_convol_full!(w, v, u)
	else
		n=length(u)
		m=length(v)
		for i=0:n+m-2
			k=max(i+1-m, 0)
			l=min(n-1,i) 

			w[i+1]=zero(eltype(u))
			for j=k:l
				w[i+1]+=u[j+1]*v[i+1-j]
			end
		end
	end
end
A = rand(10000); B=rand(10000); D = zeros(Float64, (length(A)+ length(B)-1));

and the REPL output:

julia> @time naive_convol_full!(D, A, B)
  0.126681 seconds
julia> @time C=np.convolve(A, B, "full");
  0.018948 seconds (41 allocations: 158.094 KiB)
julia> C≈D
true

Try to add @inbounds or @simd calls to the hot loops. Currently your implementation is checking if the indices are valid and that slow things down for example.

1 Like

Also, it looks like your current implementation was copied from a language with 0 indexing.

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		return naive_convol_full!(w, v, u)
	end
        n=length(u)
	m=length(v)
	@inbounds @fastmath for i=1:n+m-1
		w[i]=zero(eltype(u))
		for j=max(i-m, 0):(min(n,i) - 1)
			w[i]+=u[j+1]*v[i-j]
		end
	end
end

This takes @juliohm’s suggestion and also changes i+1 to i which makes things a little clearer in my opinion.

1 Like

why would numpy version be this double for loop and not fft based (for example) ?

1 Like

If you drill down in the Numpy source code, I think its convolution ends up at multiarraymodule.c:_pyarray_correlate. It looks like it just does a sequence of dot calls, which probably is some optimized SIMD-based function, and possibly uses threads.

To get similar SIMD utilization, you might want to try GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops.

2 Likes

Thanks, didn’t know this trick ! Doesn’t make much of a difference though…

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		naive_convol_full!(v, u, w)
	else
		n=length(u)
		m=length(v)
		@simd for i=0:n+m-2
			k=max(i+1-m, 0)
			l=min(n-1,i) 

			@inbounds w[i+1]=zero(eltype(u))
			@simd for j=k:l
				@inbounds w[i+1]+=u[j+1]*v[i+1-j]
			end
		end
	end
end

REPL

julia> @time naive_convol_full!(D, A,B)
  0.129330 seconds

Sorry, not copied, I wrote it myself :sweat_smile: Thanks for the code though !
Just ran it, I gained a few milliseconds but not enough to make the difference

julia> function naive_convol_full!(w, u, v)
               if length(u)<length(v)
                       return naive_convol_full!(w, v, u)
               end
               n=length(u)
               m=length(v)
               @inbounds @fastmath for i=1:n+m-1
                       w[i]=zero(eltype(u))
                       for j=max(i-m, 0):(min(n,i) - 1)
                               w[i]+=u[j+1]*v[i-j]
                       end
               end
       end
naive_convol_full! (generic function with 1 method)

julia> @time naive_convol_full!(D, A,B)
  0.142548 seconds (35.08 k allocations: 1.943 MiB, 11.77% compilation time)

julia> @time naive_convol_full!(D, A,B)
  0.126336 seconds

I don’t actually know, but that’s what’s said in the docs !

I think @stevengj answer goes into that direction

Thanks, I couldn’t figure out how to use @turbo from LoopVectorization but I got a 3x speedup by replacing the innermost loop by a dot product (following your suggestion). That’ll have to do for now ! The numpy call is still about 2.2x faster than the Julia one though.
Thank you all for your help — I’m still curious if anyone can figure out a better way!

using LinearAlgebra, PyCall
np = pyimport("numpy")

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		return naive_convol_full!(w, v, u)
	end
        n=length(u)
	m=length(v)
	@inbounds for i=1:n+m-1
		w[i]=zero(eltype(u))
		@simd for j=max(i-m, 0):(min(n,i) - 1)
			w[i]+=u[j+1]*v[i-j]
		end
	end
end

function matmul_convol_full!(w, u, v)
	if length(u)<length(v)
		matmul_convol_full!(w, v, u)
	else
		n=length(u)
		m=length(v)
		@inbounds for i=1:n+m-1
			k=max(i-m, 0)
			l=(min(n,i) - 1) 

			@views w[i]= dot(u[k+1:l+1],  v[i-k:-1:i-l])
		end
	end
end
A = rand(10000); B=rand(10000); D = zeros(Float64, (length(A)+ length(B)-1));
julia> @time naive_convol_full!(D, A, B)
  0.128569 seconds
julia> @time matmul_convol_full!(D, A, B)
  0.041894 seconds
julia> @time C=np.convolve(A, B, "full");
  0.018964 seconds (41 allocations: 158.094 KiB)

romainvieme, without actually checking your code, may I point you to a discussion which followed this post:

1 Like

Looks relevant indeed :smile: I didn’t bump into that in my preliminary research ! I’ll be sure to have a look. Thanks !

The naive loop is faster than numpy, just make sure it gets SIMD’ed:

function naive_convol_full!(w, u, v)
    if length(u) < length(v)
        return naive_convol_full!(w, v, u)
    end
    n = length(u)
    m = length(v)
    for i = 1:n+m-1
        s = zero(eltype(u))
        @simd for j = max(i-m,0):min(n,i)-1
            s += u[j+1] * v[i-j]
        end
        w[i] = s
    end
end

@btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=1
  15.383 ms (0 allocations: 0 bytes)

And if you use @tturbo from LoopVectorizations , you get it even faster:

    ...
        @tturbo for j = max(i-m,0):min(n,i)-1
            s += u[j+1] * v[i-j]
        end
    ...

@btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=1
  9.398 ms (0 allocations: 0 bytes)
11 Likes

Great ! Thanks ! That seems to work on my machine

function naive_convol_full!(w, u, v)
    if length(u) < length(v)
        return naive_convol_full!(w, v, u)
    end
    n = length(u)
    m = length(v)
    @inbounds for i = 1:n+m-1
        s = zero(eltype(u))
        @tturbo for j = max(i-m,0):min(n,i)-1
            s += u[j+1] * v[i-j]
        end
        w[i] = s
    end
end
julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
  19.402 ms (0 allocations: 0 bytes)

julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(10000); B=rand(10000)) evals=100;
  18.031 ms (41 allocations: 158.09 KiB)

So the difference is 1.5ms. Works great !

Doing further homework it does not seem to scale nicely

julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(100000); B=rand(10000)) evals=100;
  182.196 ms (41 allocations: 861.22 KiB)
julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(100000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
  202.405 ms (0 allocations: 0 bytes)
julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(100000); B=rand(100000); D=zeros(length(A)+length(B)-1)) evals=100;
  3.355 s (0 allocations: 0 bytes)
julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(100000); B=rand(100000)) evals=100;
  724.136 ms (41 allocations: 1.53 MiB)

So I still don’t understand everything ! On another laptop however, the Julia code is clearly faster. I guess that really depends on hardware then…

Edit: The other laptop

julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
  15.199 ms (0 allocations: 0 bytes)

julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(10000); B=rand(10000)) evals=100;
  111.462 ms (41 allocations: 158.09 KiB)

By any chance did you benchmark the Python version on your laptop?

Notice that Numpy uses multithreading, so you should be comparing with the multithreaded version in Julia too:

  9.295 ms (0 allocations: 0 bytes)       # Julia (10000)
  17.367 ms (41 allocations: 158.09 KiB)  # numpy (10000)

  873.935 ms (1 allocation: 32 bytes)     # Julia (100000)
  1.213 s (41 allocations: 1.53 MiB)      # numpy (100000)
1 Like

I suspected as well, thanks for confirming :grinning: I’ve tried to shut down this feature of numpy but didn’t succeed. Thanks again, I don’t have any further questions !

1 Like

Yep, found that one but couldn’t figure out which BLAS library it’s using on my system (up-to-date Debian stable with Python 3.9.2). I’ll give it more thought.

In [3]: numpy.__config__.show()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
accelerate_info:
  NOT AVAILABLE
blas_info:
    libraries = ['blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    include_dirs = ['/usr/local/include', '/usr/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    include_dirs = ['/usr/local/include', '/usr/include']
    language = c
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
    libraries = ['lapack', 'lapack']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'lapack', 'blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/local/include', '/usr/include']