Performance of naive convolution against Python Numpy

romainvieme · February 1, 2022, 8:34pm

I wrote a naive convolution function for some personal purpose and tested it against Python numpy’s function (which does not use FFT).
On my computer the Julia script is about 7x slower than Python numpy’s version. Could anyone help me understand why it’s slow or how to accelerate the code?
Here is my tentative:

using PyCall
np = pyimport("numpy")

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		naive_convol_full!(w, v, u)
	else
		n=length(u)
		m=length(v)
		for i=0:n+m-2
			k=max(i+1-m, 0)
			l=min(n-1,i) 

			w[i+1]=zero(eltype(u))
			for j=k:l
				w[i+1]+=u[j+1]*v[i+1-j]
			end
		end
	end
end
A = rand(10000); B=rand(10000); D = zeros(Float64, (length(A)+ length(B)-1));

and the REPL output:

julia> @time naive_convol_full!(D, A, B)
  0.126681 seconds
julia> @time C=np.convolve(A, B, "full");
  0.018948 seconds (41 allocations: 158.094 KiB)
julia> C≈D
true

juliohm · February 1, 2022, 8:40pm

Try to add @inbounds or @simd calls to the hot loops. Currently your implementation is checking if the indices are valid and that slow things down for example.

Oscar_Smith · February 1, 2022, 8:45pm

Also, it looks like your current implementation was copied from a language with 0 indexing.

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		return naive_convol_full!(w, v, u)
	end
        n=length(u)
	m=length(v)
	@inbounds @fastmath for i=1:n+m-1
		w[i]=zero(eltype(u))
		for j=max(i-m, 0):(min(n,i) - 1)
			w[i]+=u[j+1]*v[i-j]
		end
	end
end

This takes @juliohm’s suggestion and also changes i+1 to i which makes things a little clearer in my opinion.

rveltz · February 1, 2022, 8:46pm

why would numpy version be this double for loop and not fft based (for example) ?

stevengj · February 1, 2022, 8:53pm

If you drill down in the Numpy source code, I think its convolution ends up at multiarraymodule.c:_pyarray_correlate. It looks like it just does a sequence of dot calls, which probably is some optimized SIMD-based function, and possibly uses threads.

To get similar SIMD utilization, you might want to try GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops.

romainvieme · February 1, 2022, 8:53pm

Thanks, didn’t know this trick ! Doesn’t make much of a difference though…

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		naive_convol_full!(v, u, w)
	else
		n=length(u)
		m=length(v)
		@simd for i=0:n+m-2
			k=max(i+1-m, 0)
			l=min(n-1,i) 

			@inbounds w[i+1]=zero(eltype(u))
			@simd for j=k:l
				@inbounds w[i+1]+=u[j+1]*v[i+1-j]
			end
		end
	end
end

REPL

julia> @time naive_convol_full!(D, A,B)
  0.129330 seconds

romainvieme · February 1, 2022, 8:55pm

Sorry, not copied, I wrote it myself Thanks for the code though !
Just ran it, I gained a few milliseconds but not enough to make the difference

julia> function naive_convol_full!(w, u, v)
               if length(u)<length(v)
                       return naive_convol_full!(w, v, u)
               end
               n=length(u)
               m=length(v)
               @inbounds @fastmath for i=1:n+m-1
                       w[i]=zero(eltype(u))
                       for j=max(i-m, 0):(min(n,i) - 1)
                               w[i]+=u[j+1]*v[i-j]
                       end
               end
       end
naive_convol_full! (generic function with 1 method)

julia> @time naive_convol_full!(D, A,B)
  0.142548 seconds (35.08 k allocations: 1.943 MiB, 11.77% compilation time)

julia> @time naive_convol_full!(D, A,B)
  0.126336 seconds

romainvieme · February 1, 2022, 8:56pm

I don’t actually know, but that’s what’s said in the docs !

rveltz · February 1, 2022, 9:02pm

I think @stevengj answer goes into that direction

romainvieme · February 1, 2022, 9:45pm

Thanks, I couldn’t figure out how to use @turbo from LoopVectorization but I got a 3x speedup by replacing the innermost loop by a dot product (following your suggestion). That’ll have to do for now ! The numpy call is still about 2.2x faster than the Julia one though.
Thank you all for your help — I’m still curious if anyone can figure out a better way!

using LinearAlgebra, PyCall
np = pyimport("numpy")

function naive_convol_full!(w, u, v)
	if length(u)<length(v)
		return naive_convol_full!(w, v, u)
	end
        n=length(u)
	m=length(v)
	@inbounds for i=1:n+m-1
		w[i]=zero(eltype(u))
		@simd for j=max(i-m, 0):(min(n,i) - 1)
			w[i]+=u[j+1]*v[i-j]
		end
	end
end

function matmul_convol_full!(w, u, v)
	if length(u)<length(v)
		matmul_convol_full!(w, v, u)
	else
		n=length(u)
		m=length(v)
		@inbounds for i=1:n+m-1
			k=max(i-m, 0)
			l=(min(n,i) - 1) 

			@views w[i]= dot(u[k+1:l+1],  v[i-k:-1:i-l])
		end
	end
end
A = rand(10000); B=rand(10000); D = zeros(Float64, (length(A)+ length(B)-1));

julia> @time naive_convol_full!(D, A, B)
  0.128569 seconds
julia> @time matmul_convol_full!(D, A, B)
  0.041894 seconds
julia> @time C=np.convolve(A, B, "full");
  0.018964 seconds (41 allocations: 158.094 KiB)

Eben60 · February 1, 2022, 10:20pm

romainvieme, without actually checking your code, may I point you to a discussion which followed this post:

romainvieme · February 1, 2022, 10:28pm

Looks relevant indeed I didn’t bump into that in my preliminary research ! I’ll be sure to have a look. Thanks !

Seif_Shebl · February 2, 2022, 12:58am

The naive loop is faster than numpy, just make sure it gets SIMD’ed:

function naive_convol_full!(w, u, v)
    if length(u) < length(v)
        return naive_convol_full!(w, v, u)
    end
    n = length(u)
    m = length(v)
    for i = 1:n+m-1
        s = zero(eltype(u))
        @simd for j = max(i-m,0):min(n,i)-1
            s += u[j+1] * v[i-j]
        end
        w[i] = s
    end
end

@btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=1
  15.383 ms (0 allocations: 0 bytes)

And if you use @tturbo from LoopVectorizations , you get it even faster:

    ...
        @tturbo for j = max(i-m,0):min(n,i)-1
            s += u[j+1] * v[i-j]
        end
    ...

@btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=1
  9.398 ms (0 allocations: 0 bytes)

romainvieme · February 2, 2022, 11:33am

Great ! Thanks ! That seems to work on my machine

function naive_convol_full!(w, u, v)
    if length(u) < length(v)
        return naive_convol_full!(w, v, u)
    end
    n = length(u)
    m = length(v)
    @inbounds for i = 1:n+m-1
        s = zero(eltype(u))
        @tturbo for j = max(i-m,0):min(n,i)-1
            s += u[j+1] * v[i-j]
        end
        w[i] = s
    end
end

julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
  19.402 ms (0 allocations: 0 bytes)

julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(10000); B=rand(10000)) evals=100;
  18.031 ms (41 allocations: 158.09 KiB)

So the difference is 1.5ms. Works great !

romainvieme · February 2, 2022, 11:48am

Doing further homework it does not seem to scale nicely

julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(100000); B=rand(10000)) evals=100;
  182.196 ms (41 allocations: 861.22 KiB)
julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(100000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
  202.405 ms (0 allocations: 0 bytes)
julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(100000); B=rand(100000); D=zeros(length(A)+length(B)-1)) evals=100;
  3.355 s (0 allocations: 0 bytes)
julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(100000); B=rand(100000)) evals=100;
  724.136 ms (41 allocations: 1.53 MiB)

So I still don’t understand everything ! On another laptop however, the Julia code is clearly faster. I guess that really depends on hardware then…

Edit: The other laptop

julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
  15.199 ms (0 allocations: 0 bytes)

julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(10000); B=rand(10000)) evals=100;
  111.462 ms (41 allocations: 158.09 KiB)

romainvieme · February 2, 2022, 12:01pm

By any chance did you benchmark the Python version on your laptop?

Seif_Shebl · February 2, 2022, 12:25pm

Notice that Numpy uses multithreading, so you should be comparing with the multithreaded version in Julia too:

  9.295 ms (0 allocations: 0 bytes)       # Julia (10000)
  17.367 ms (41 allocations: 158.09 KiB)  # numpy (10000)

  873.935 ms (1 allocation: 32 bytes)     # Julia (100000)
  1.213 s (41 allocations: 1.53 MiB)      # numpy (100000)

romainvieme · February 2, 2022, 12:38pm

I suspected as well, thanks for confirming I’ve tried to shut down this feature of numpy but didn’t succeed. Thanks again, I don’t have any further questions !

stevengj · February 2, 2022, 12:42pm

romainvieme · February 2, 2022, 12:45pm

Yep, found that one but couldn’t figure out which BLAS library it’s using on my system (up-to-date Debian stable with Python 3.9.2). I’ll give it more thought.

In [3]: numpy.__config__.show()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
accelerate_info:
  NOT AVAILABLE
blas_info:
    libraries = ['blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    include_dirs = ['/usr/local/include', '/usr/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    include_dirs = ['/usr/local/include', '/usr/include']
    language = c
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
    libraries = ['lapack', 'lapack']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'lapack', 'blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/local/include', '/usr/include']

Topic		Replies	Views
General questions from Python user Performance	59	4460	March 8, 2021
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	11138	October 15, 2019
Vector addition in Julia slower than numpy in Linux Performance	21	1375	May 1, 2020
Comparing Python, Julia, and C++ Performance broadcast , python	21	33570	November 1, 2018
Slow matrix multiplication in Julia compared to Python numpy New to Julia question	17	5623	May 19, 2018

Performance of naive convolution against Python Numpy

Related topics