In case anyone bumps into this thread again and wonders the same things than I am, @tturbo
uses multithreading by default but you need to start Julia with several threads to see the difference. So for a fair comparison:
julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(10000); B=rand(10000)) evals=100;
18.287 ms (41 allocations: 158.09 KiB)
julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
11.768 ms (0 allocations: 0 bytes)
and with
julia -t 16
julia> @btime C=np.convolve(A, B, "full") setup=(A=rand(10000); B=rand(10000)) evals=100;
18.515 ms (41 allocations: 158.09 KiB)
julia> @btime naive_convol_full!(D,A,B) setup=(A=rand(10000); B=rand(10000); D=zeros(length(A)+length(B)-1)) evals=100;
6.763 ms (0 allocations: 0 bytes)