The way I’ve tried to get a better equivalent for Python is to use min(timeit.repeat(f(), number=1, repeat=1000))
. So, you get the actual runtime of each trial and then just pick off the minimum.
You can do the following to take the minimum run time in Python:
def py_time(f, number=1_000):
timer = timeit.Timer(f)
time = min(timer.repeat(number=number))
return time/number
In general, minimum is the better / more stable measure for run time because noise is strictly positive (there are no random effects which make your code run faster).
Edit: @mihalybaci was faster
It’s easy to give examples when simple Python code (with common numerical libraries, of course) is faster than a direct translation to julia. E.g. see an old topic of mine, Trig functions very slow.
In the course of my weekend procrastinations I contrieved an example to check my statement: some home-made implementation of convolution in both languages.
Make up some signal and kernel data in Python (unimportant part which is not benchmarked):
import numpy as np
def lorentz(x, gamma, mu=0):
return (gamma**2/(((x-mu)**2)+gamma**2))/(np.pi*gamma)
def signal(x):
return lorentz(x, 0.1, -0.3) + lorentz(x, 0.03, 0.1) + lorentz(x, 0.2, 0.25)
def gauss(x, sigma, mu=0.0):
return np.exp(-(((x-mu)/sigma)**2)/2.0)/(sigma*np.sqrt(2*np.pi))
def kernel(x):
return gauss(x, 0.01)
vgauss = np.vectorize(gauss)
vlorentz = np.vectorize(lorentz)
vkernel = np.vectorize(kernel)
vsignal = np.vectorize(signal)
k_xmax = 0.05
s_xmax = 0.5
prec = 4
len_s = 1 + 10**prec
q = 2*s_xmax/10**prec
len_k2 = int(k_xmax/q)
len_k = 1 + 2*len_k2
s_x = np.linspace(-s_xmax, s_xmax, len_s)
k_x = np.linspace(-k_xmax, k_xmax, len_k)
s_y = vsignal(s_x)
k_y = vkernel(k_x)
Now the same in Julia:
lorentz(x, γ, μ=0) = (γ^2/((x-μ)^2+γ^2))/(π*γ)
signal(x) = lorentz(x, 0.1, -0.3) + lorentz(x, 0.03, 0.1) + lorentz(x, 0.2, 0.25)
gauss(x, σ, μ=0.0) = exp(-(((x-μ)/σ)^2)/2)/(σ*sqrt(2π))
kernel(x) = gauss(x, 0.01)
s_xmax = 0.5
k_xmax = 0.05
prec = 4
len_s = 1 + 10^prec
q = 2*s_xmax/10^prec
len_k2 = Int(k_xmax/q)
len_k = 1 + 2len_k2
s_x = LinRange(-s_xmax, s_xmax, len_s)
k_x = LinRange(-k_xmax, k_xmax, len_k)
s_y = signal.(s_x)
k_y = kernel.(k_x)
Now the convolution in Python:
def convolve1(s_y, k_y, len_s, len_k, q):
r = np.zeros_like(s_y)
k_rev = np.flip(k_y)
len_k2 = int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in range(fst, lst):
r[i-1] = q * np.sum(s_y[i-len_k2:i-len_k2+len_k] * k_rev)
return r
and in Julia:
function convolve1(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
r[i] = q * sum(s_y[i-len_k2+1:i-len_k2+len_k] .* k_rev)
end
return r
end
In fact I of course wrote Julia first, but let’s pretend the code was ported from Python to Julia. As to the code: I can’t guarantee it’s mathematically or otherwise correct, but we don’t need no convolution - just some identical code in both languages, reasonably vectorized in Numpy on the Python side - which it is, I believe.
Let’s test:
def cnv1():
return convolve1(s_y, k_y, len_s, len_k, q)
import timeit
tm = np.amin(timeit.repeat(cnv1, repeat=50, number=1))
print("tm = ", tm)
# tm = 0.062673
julia> using BenchmarkTools
julia> @btime convolve1(s_y, k_y, len_s, len_k, q);
# 24.115 ms (18003 allocations: 140.71 MiB)
Julia is 2.5 times faster - not bad! But still we see a lot of allocations meaning a room for improvement. As mentioned in one of the posts above, let’s use views instead of slices :
function convolve2(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
r[i] = q * sum(view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev)
end
return r
end
# julia> @btime convolve2(s_y, k_y, len_s, len_k, q);
# 13.560 ms (9003 allocations: 70.40 MiB)
Substantial improvement, still a lot of allocations. Next try:
function convolve3(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
t = similar(k_y) # pre-allocate a buffer
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
t .= view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev
r[i] = q * sum(t)
end
return r
end
# julia> @btime convolve3(s_y, k_y, len_s, len_k, q);
# 1.860 ms (4 allocations: 94.27 KiB)
That looks good! Actually I would wish the compiler was smart enough to do the preallocation on it’s own here.
Probably it can be made still faster by some kind of multithreading, as the for
loop here would perfectly lend itself to parallel execution. Lets’s try it:
Edit: don’t try it - see posts below concerning race conditions
function convolve4(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
t = similar(k_y)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
Threads.@threads for i in fst:lst
t .= view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev
r[i] = q * sum(t)
end
return r
end
# julia> @btime convolve4(s_y, k_y, len_s, len_k, q);
# 4.737 ms (25 allocations: 97.73 KiB)
It got worse Maybe the dataset to small for effective use of parallel processing? Let’s make it 10 times bigger:
prec = 5
julia> @btime convolve3(s_y, k_y, len_s, len_k, q);
305.176 ms (6 allocations: 937.92 KiB)
julia> @btime convolve4(s_y, k_y, len_s, len_k, q);
216.561 ms (27 allocations: 941.39 KiB)
Here we indeed got some improvement by multithreading. Can it be improved further?
Now we got it much better, let’s try to get it much worse Let’s feed our first implementation not with arrays of floats, but with arrays of Any
. It is not as contrieved as some would probably claim: The data could come e.g. from a file or some other source without the type guarantee.
s_y_any = Array{Any,1}(undef, len_s)
s_y_any .= s_y
k_y_any = Array{Any,1}(undef, len_k)
k_y_any .= k_y
# julia> @btime convolve1(s_y_any, k_y_any, len_s, len_k, q);
# 476.886 ms (13499992 allocations: 347.25 MiB)
That’t 8 times slower than Python.
I think you can avoid the temporary vector by using a dot product instead (untested):
r[i] = (view(s_y, i-len_k2+1:i-len_k2+len_k)' * k_rev) * q
Also I don’t think you can use multithreading as long as you use a temporary t
, since there will be race conditions for writing into t
.
But try threading with the dot product instead.
Update: I indeed get a nice speedup with threading, if I remove the temporary t
:
function convolve5t(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = div((len_k - 1), 2) # use div, not Int(x/2), here
fst = 1 + len_k2
lst = len_s - len_k2 - 1
Threads.@threads for i in fst:lst
r[i] = q * (view(s_y, i-len_k2+1:i-len_k2+len_k)' * k_rev)
end
return r
end
jl> @btime convolve5t($s_y, $k_y, $len_s, $len_k, $q);
419.700 μs (84 allocations: 93.52 KiB)
Actually there you don’t need that allocation at all:
function convolve_new(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
k = firstindex(k_rev)
t = 0.
@inbounds @simd for j in i-len_k2+1:i-len_k2+len_k
t += s_y[j]*k_rev[k]
k += 1
end
r[i] = q*t
end
return r
end
And this turns out to be 2 times faster than coevolve3
:
julia> @btime convolve3($s_y, $k_y, $len_s, $len_k, $q);
1.665 ms (4 allocations: 94.27 KiB)
julia> @btime convolve_new($s_y, $k_y, $len_s, $len_k, $q);
880.245 μs (3 allocations: 86.27 KiB)
Yes, just doing
@views function convolve_dot(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = (len_k - 1) ÷ 2
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
r[i] = q * (s_y[i-len_k2+1:i-len_k2+len_k]' * k_rev)
end
return r
end
@btime convolve_dot($s_y, $k_y, $len_s, $len_k, $q);
it got about 30x faster than the original convolve1
function on my laptop.
“Avoid allocations in inner loops” is a pretty simple rule, but it is unfamiliar to most people coming from Python or Matlab or R where allocations are a tertiary concern and vectorization is everything.
(It also wouldn’t be terrible to simply write out the inner loop, though to get peak performance one would need @simd
or @avx
. This is especially useful if you are doing something more complicated than a convolution. You don’t always need or want to dig through the standard library for a function that does exactly what you need!)
Maybe that is what I did there. Here it is slightly faster than using the dot product:
julia> @btime convolve_dot($s_y, $k_y, $len_s, $len_k, $q);
938.193 μs (3 allocations: 86.27 KiB)
julia> @btime convolve_new($s_y, $k_y, $len_s, $len_k, $q);
880.281 μs (3 allocations: 86.27 KiB)
Edit: And that version, threaded (4 threads) works pretty well:
function convolve_new_threaded(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
Threads.@threads for i in fst:lst
k = firstindex(k_rev)
t = 0.
@inbounds @simd for j in i-len_k2+1:i-len_k2+len_k
t += s_y[j]*k_rev[k]
k += 1
end
r[i] = q*t
end
return r
end
julia> Threads.nthreads()
4
julia> @btime convolve_new_threaded($s_y, $k_y, $len_s, $len_k, $q);
249.114 μs (24 allocations: 89.30 KiB)
Also keep in mind that removing that temporary array is not just about performance, but also correctness. The different threads will trample each other accessing and modifying the temporary out of sequence. It can even be modified by one thread while another is trying sum it.
Yes, that is very true! My convolve4
actually produces a wrong result.
Somehow I kept it in mind with the race condition, but then forgot.
The following works correctly:
function convolve6(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
t0 = similar(k_y)
t = [t0]
for i in 2:Threads.nthreads() # nthreads is 4 on my computer
push!(t, copy(t0))
end
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
Threads.@threads for i in fst:lst
ti = t[Threads.threadid()]
ti .= view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev
r[i] = q * sum(ti)
end
return r
end
# julia> @btime convolve6($s_y, $k_y, $len_s, $len_k, $q);
# 503.801 μs (30 allocations: 121.88 KiB)
(as compared with 1.86 ms for the non-threaded version convolve3
on my computer).
now all results on the same (my) computer:
-
Initial variant:
convolve1
# 24.115 ms (18003 allocations: 140.71 MiB) -
Views instead of slices:
convolve2
# 13.560 ms (9003 allocations: 70.40 MiB) -
Preallocated array:
convolve3
# 1.860 ms (4 allocations: 94.27 KiB) -
Preallocated array with
@threads
:convolve6
# 503.801 μs (30 allocations: 121.88 KiB) -
Using dot-product with
@threads
: convolve5t # 277.300 μs (24 allocations: 89.73 KiB) -
Write out the inner loop using
@simd
:convolve_new
# 982.099 μs (3 allocations: 86.27 KiB) -
The same plus threading:
convolve_new_threaded
# 256.500 μs (24 allocations: 89.73 KiB)
I am curious how a numba
version of the python code would perform there.
or numpy.convolve(a , v , mode)
Well, that is a curiosity (it is probably fast) but doesn’t have the same meaning of comparison relative to a code one can write directly.
My interest is more in these python JIT accelerators, and how they behave relative to Julia code. I will try that, but I may make naive mistakes in python and thus produce meaningless comparisons.
Well, actually I didn’t expect that python would be that much slower that Julia in this case, as most of the computation is actually vectorized in numpy. But it is possible that it is 10000 loops of pure python is what takes the biggest part of these 60 ms - in this case numba acceleration may help.
Adding import numba
and @numba.njit
before every function definition takes the time
of the python version from 51ms
to 16ms
here (I take no responsibility for that result, just added the flags - I have no experience with that).
Code
import numpy as np
import numba
@numba.njit
def lorentz(x, gamma, mu=0):
return (gamma**2/(((x-mu)**2)+gamma**2))/(np.pi*gamma)
@numba.njit
def signal(x):
return lorentz(x, 0.1, -0.3) + lorentz(x, 0.03, 0.1) + lorentz(x, 0.2, 0.25)
@numba.njit
def gauss(x, sigma, mu=0.0):
return np.exp(-(((x-mu)/sigma)**2)/2.0)/(sigma*np.sqrt(2*np.pi))
@numba.njit
def kernel(x):
return gauss(x, 0.01)
vgauss = np.vectorize(gauss)
vlorentz = np.vectorize(lorentz)
vkernel = np.vectorize(kernel)
vsignal = np.vectorize(signal)
k_xmax = 0.05
s_xmax = 0.5
prec = 4
len_s = 1 + 10**prec
q = 2*s_xmax/10**prec
len_k2 = int(k_xmax/q)
len_k = 1 + 2*len_k2
s_x = np.linspace(-s_xmax, s_xmax, len_s)
k_x = np.linspace(-k_xmax, k_xmax, len_k)
s_y = vsignal(s_x)
k_y = vkernel(k_x)
@numba.njit
def convolve1(s_y, k_y, len_s, len_k, q):
r = np.zeros_like(s_y)
k_rev = np.flip(k_y)
len_k2 = int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in range(fst, lst):
r[i-1] = q * np.sum(s_y[i-len_k2:i-len_k2+len_k] * k_rev)
return r
@numba.njit
def cnv1():
return convolve1(s_y, k_y, len_s, len_k, q)
import timeit
tm = np.amin(timeit.repeat(cnv1, repeat=50, number=1))
print("tm = ", tm)
If I take your code with inner loop convolve_new
and remove @inbounds @simd
, then I get
# julia> @btime convolve_11($s_y, $k_y, $len_s, $len_k, $q);
# 10.022 ms (3 allocations: 86.27 KiB)
which is similar to your numba-JIT python result, so my guess is that numpy-broadcasted multiplication of arrays probably doesn’t use SIMD-vectorization.
Actually I’ve written a similar code myself in the beginning, but with a small but important difference (see below), got a poor result and didn’t put it into discussion especially as I wanted to compare possibly similar Python and Julia codes.
function convolve_12(s_y, k_y, len_s, len_k, q)
# ...........#
for i in fst:lst
k = firstindex(k_rev)
# t = 0. # not using it, but:
for j in i-len_k2+1:i-len_k2+len_k
r[i] += s_y[j]*k_rev[k] # summation directly into r elements
k += 1
end
r[i] *= q
end
return r
end
# julia> @btime convolve_12($s_y, $k_y, $len_s, $len_k, $q);
# 22.445 ms (3 allocations: 86.27 KiB)
By putting ´@inbounds @simd´ back (actually ´@inbounds´ saving is minimal here) we win back about the same 10 ms as in your code, just the difference is not from 10 ms to 1 ms, but from 20 ms to 10ms
@inbounds @simd for j in i-len_k2+1:i-len_k2+len_k
# julia> @btime convolve_13($s_y, $k_y, $len_s, $len_k, $q);
# 9.928 ms (3 allocations: 86.27 KiB)
Two questions:
- Why is it that bad to perform summation directly into array elements?
- Is there anything on this point in the Performance Tips
Because then it has to mutate an element in the heap at every iteration of that loop. Instead, with the temporary variable, the value is accumulated in an immutable stack (or cache) variable, which is much faster. The compiler could eventually figure that out and do that automatically, but that is less predictable, it depends on the context of the code. I am guessing that it is not doing that because the index of r
is part of the counter of the loop where it is modified. But I am just guessing.
(Probably in a CPU register, which is even faster.)