Exactly. This is correct. It is easy to write fast Julia code, if one is aware of the basic stuff. But there is an initial set of things to learn and get used to (as expected for everything that does not do magic).
I think we do not disagree in anything.
Naive implementation usually also includes slicing without views. Since numpy slice are equivalent to Julia views, direct translation of numpy code can result in huge allocations and as a result very bad performance.
That’s of course correct, but it seems like this case is the most common demonstration of “Julia is slower than Python.” People tend to:
- benchmark in global scope
- without running once to compile, and
- without interpolating their globals with
$
- and often with a type instability
It’s understandable why they do this. They see a neat blog with a quick Jupyter notebook, they download Julia, and want to try a couple things out. It’s probably necessary to read several different parts of the manual to get a proper, non-naive benchmark.
Naive is not a synonyme for plain, nor for evil. Oxford Languages defines it as “showing a lack of experience, wisdom, or judgement”, or “natural and unaffected; innocent”. I would thus define it as having the best intentions, but little expertise.
Some Julia-specific ways of unintentionally shooting oneself in the foot were listed in the comments above. It is easy to get a slow-down by two orders of magnitude. An example of a slow-down by a factor of 70 due to unnecessary allocations was actually cited in my previous post.
respect to a more optimal Julia code, sure. But what about when compared to Python / Numpy? Your last statement was:
but that post was about C++. Now, I think everyone can agree there are more quicks in C++ than Python or Julia.
Now I could probably write a sufficiently fast solution of the cited problem in Python. I wouldn’t. You won, Julia is the best, and inherently better than Python.
For everyone else - see above. Thank you.
straw man fallacy is not fun. That was not my argument at all.
There have been heaps of threads like “Why is my code translated from Python so much slower in Julia?” I don’t understand the point of denying that naive Julia code can be slower than Python.
I’m yet to see a computationally heavy task (i.e. not about julia starting time is slow) that is written in either: both using for loop, or, both in vectorized/broadcasting style; that shows Julia to be much slower than Python/Numpy.
The point is just that if one doesn’t bother to learn Julian idioms and techniques, and instead just ‘writes Python’ in Julia with surface level syntax changes, it’s absolutely quite easy to end up with massive performance problems.
Perhaps the best example would just be tight loops involving global variables, or creating empty, I typed arrays and then pushing to them. Sure, it’s pretty easy to learn how to avoid these problems, but that’s not the point.
for the record, I just want to amend to:
that, in these practices Python will be slow too (due to similar reason, being a dynamic language itself) and very often slower than Julia:
julia> a = rand(10^5);
julia> b = 0;
julia> @btime for x in a
if x > 0.5
global b+=x
end
end
6.738 ms (349237 allocations: 6.85 MiB)
In [16]: a = np.random.rand(10**5)
In [17]: b=0
In [18]: %%timeit
...: for x in a:
...: global b
...: if x > 0.5:
...: b+=x
...:
13 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But I think everyone in the thread have something to takeaway and I guess I should stop being annoying.
It can sometimes carry a higher cost in julia however.
If you’d like an example, here’s an example from this thread:
In [6]: def euclidian_algorithm_division_count(a, b):
...: division_count = 1
...: if b > a:
...: a, b = b, a
...: while (c := a % b) != 0:
...: a, b = b, c
...: division_count += 1
...: return division_count
...:
...: from random import randint
In [7]: %%timeit
...: N = 10**100
...: M = 10**4
...: division_count_array = []
...: while M > 0:
...: a = randint(1, N)
...: b = randint(1, N)
...: division_count_array.append(euclidian_algorithm_division_count(a, b))
...: M -= 1
292 ms ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
julia> function euclidean_algorithm_division_count(a, b)
division_count = 1
if b > a
a, b = b, a
end
while (c = a % b) != 0
a, b = b, c
division_count += 1
end
return division_count
end
euclidean_algorithm_division_count (generic function with 1 method)
julia> function main()
N = big(10)^100
M = 10^4
division_count_array = []
while M > 0
a, b = rand(1:N, 2)
push!(division_count_array, euclidean_algorithm_division_count(a, b))
M -= 1
end
end
main (generic function with 1 method)
julia> @btime main()
378.040 ms (5618922 allocations: 110.55 MiB)
It’s of course not very hard to make the julia version beat the Python version, but this straightforward transcription (that even uses a function) of naive Python code can still be slower in julia.
Thanks for bearing with me! This is a pretty neat pedagogical example!
Ah, looks like a BigInt issue, less interesting than I initially thought.
be careful with your timings though. timeit
does report the mean, whereas btime
reports the minimum run time.
The way I’ve tried to get a better equivalent for Python is to use min(timeit.repeat(f(), number=1, repeat=1000))
. So, you get the actual runtime of each trial and then just pick off the minimum.
You can do the following to take the minimum run time in Python:
def py_time(f, number=1_000):
timer = timeit.Timer(f)
time = min(timer.repeat(number=number))
return time/number
In general, minimum is the better / more stable measure for run time because noise is strictly positive (there are no random effects which make your code run faster).
Edit: @mihalybaci was faster
It’s easy to give examples when simple Python code (with common numerical libraries, of course) is faster than a direct translation to julia. E.g. see an old topic of mine, Trig functions very slow.
In the course of my weekend procrastinations I contrieved an example to check my statement: some home-made implementation of convolution in both languages.
Make up some signal and kernel data in Python (unimportant part which is not benchmarked):
import numpy as np
def lorentz(x, gamma, mu=0):
return (gamma**2/(((x-mu)**2)+gamma**2))/(np.pi*gamma)
def signal(x):
return lorentz(x, 0.1, -0.3) + lorentz(x, 0.03, 0.1) + lorentz(x, 0.2, 0.25)
def gauss(x, sigma, mu=0.0):
return np.exp(-(((x-mu)/sigma)**2)/2.0)/(sigma*np.sqrt(2*np.pi))
def kernel(x):
return gauss(x, 0.01)
vgauss = np.vectorize(gauss)
vlorentz = np.vectorize(lorentz)
vkernel = np.vectorize(kernel)
vsignal = np.vectorize(signal)
k_xmax = 0.05
s_xmax = 0.5
prec = 4
len_s = 1 + 10**prec
q = 2*s_xmax/10**prec
len_k2 = int(k_xmax/q)
len_k = 1 + 2*len_k2
s_x = np.linspace(-s_xmax, s_xmax, len_s)
k_x = np.linspace(-k_xmax, k_xmax, len_k)
s_y = vsignal(s_x)
k_y = vkernel(k_x)
Now the same in Julia:
lorentz(x, γ, μ=0) = (γ^2/((x-μ)^2+γ^2))/(π*γ)
signal(x) = lorentz(x, 0.1, -0.3) + lorentz(x, 0.03, 0.1) + lorentz(x, 0.2, 0.25)
gauss(x, σ, μ=0.0) = exp(-(((x-μ)/σ)^2)/2)/(σ*sqrt(2π))
kernel(x) = gauss(x, 0.01)
s_xmax = 0.5
k_xmax = 0.05
prec = 4
len_s = 1 + 10^prec
q = 2*s_xmax/10^prec
len_k2 = Int(k_xmax/q)
len_k = 1 + 2len_k2
s_x = LinRange(-s_xmax, s_xmax, len_s)
k_x = LinRange(-k_xmax, k_xmax, len_k)
s_y = signal.(s_x)
k_y = kernel.(k_x)
Now the convolution in Python:
def convolve1(s_y, k_y, len_s, len_k, q):
r = np.zeros_like(s_y)
k_rev = np.flip(k_y)
len_k2 = int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in range(fst, lst):
r[i-1] = q * np.sum(s_y[i-len_k2:i-len_k2+len_k] * k_rev)
return r
and in Julia:
function convolve1(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
r[i] = q * sum(s_y[i-len_k2+1:i-len_k2+len_k] .* k_rev)
end
return r
end
In fact I of course wrote Julia first, but let’s pretend the code was ported from Python to Julia. As to the code: I can’t guarantee it’s mathematically or otherwise correct, but we don’t need no convolution - just some identical code in both languages, reasonably vectorized in Numpy on the Python side - which it is, I believe.
Let’s test:
def cnv1():
return convolve1(s_y, k_y, len_s, len_k, q)
import timeit
tm = np.amin(timeit.repeat(cnv1, repeat=50, number=1))
print("tm = ", tm)
# tm = 0.062673
julia> using BenchmarkTools
julia> @btime convolve1(s_y, k_y, len_s, len_k, q);
# 24.115 ms (18003 allocations: 140.71 MiB)
Julia is 2.5 times faster - not bad! But still we see a lot of allocations meaning a room for improvement. As mentioned in one of the posts above, let’s use views instead of slices :
function convolve2(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
r[i] = q * sum(view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev)
end
return r
end
# julia> @btime convolve2(s_y, k_y, len_s, len_k, q);
# 13.560 ms (9003 allocations: 70.40 MiB)
Substantial improvement, still a lot of allocations. Next try:
function convolve3(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
t = similar(k_y) # pre-allocate a buffer
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
for i in fst:lst
t .= view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev
r[i] = q * sum(t)
end
return r
end
# julia> @btime convolve3(s_y, k_y, len_s, len_k, q);
# 1.860 ms (4 allocations: 94.27 KiB)
That looks good! Actually I would wish the compiler was smart enough to do the preallocation on it’s own here.
Probably it can be made still faster by some kind of multithreading, as the for
loop here would perfectly lend itself to parallel execution. Lets’s try it:
Edit: don’t try it - see posts below concerning race conditions
function convolve4(s_y, k_y, len_s, len_k, q)
r = zeros(len_s)
t = similar(k_y)
k_rev = reverse(k_y)
len_k2 = Int((len_k - 1) / 2)
fst = 1 + len_k2
lst = len_s - len_k2 - 1
Threads.@threads for i in fst:lst
t .= view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev
r[i] = q * sum(t)
end
return r
end
# julia> @btime convolve4(s_y, k_y, len_s, len_k, q);
# 4.737 ms (25 allocations: 97.73 KiB)
It got worse Maybe the dataset to small for effective use of parallel processing? Let’s make it 10 times bigger:
prec = 5
julia> @btime convolve3(s_y, k_y, len_s, len_k, q);
305.176 ms (6 allocations: 937.92 KiB)
julia> @btime convolve4(s_y, k_y, len_s, len_k, q);
216.561 ms (27 allocations: 941.39 KiB)
Here we indeed got some improvement by multithreading. Can it be improved further?
Now we got it much better, let’s try to get it much worse Let’s feed our first implementation not with arrays of floats, but with arrays of Any
. It is not as contrieved as some would probably claim: The data could come e.g. from a file or some other source without the type guarantee.
s_y_any = Array{Any,1}(undef, len_s)
s_y_any .= s_y
k_y_any = Array{Any,1}(undef, len_k)
k_y_any .= k_y
# julia> @btime convolve1(s_y_any, k_y_any, len_s, len_k, q);
# 476.886 ms (13499992 allocations: 347.25 MiB)
That’t 8 times slower than Python.
I think you can avoid the temporary vector by using a dot product instead (untested):
r[i] = (view(s_y, i-len_k2+1:i-len_k2+len_k)' * k_rev) * q