# Quite bad performance of Julia 0.6.4 vs Python+Numpy

Dear all,

I guess I could use a piece of good help from the experts here. The fact is that I was about to translate a piece of Monte Carlo code from my old 0.6.4 program library, into Julia 1.0 (going through 0.7 first), until a friend of mine tried essentially the same code under Anaconda Python on the same computer. I was so disappointed by the results of test runs that I’m wondering what could be wrong on my code.

I post here the Julia and the Python codes:

Julia 0.6.4:

``````function TEST_ais(W,Nsamples)

pl_1  = Int32(1)
zer_0 = Int32(0)
pl_1  = 1
zer_0 = 0

log2  = log(2.0)

Nv, Nh = size(W) .- 1

Wt = transpose(W)

vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
vvk[:,1]  = pl_1
vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
vv_k[:,1] = pl_1
size_v = size(vv_k)

hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
hhk[:,1] = pl_1
hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
hh_k[:,1] = pl_1
size_h = size(hh_k)

for i in 1:2

hh_k = zeros(Int32, size_h); hh_k[1.0./(1.0+exp.(-vv_k*W)) .> rand(size_h)] .= pl_1
vv_k = zeros(Int32, size_v); vv_k[1.0./(1.0+exp.(-hh_k*Wt)) .> rand(size_v)] .= pl_1

end

end;

W = rand(301,901)

tic()
TEST_ais(W,1024)
toc()

tic()
TEST_ais(W,1024)
toc()
``````

Python:

``````import numpy as np
import sys, time

class TEST(object):

def __init__(self):

return

def TEST_ais(self,W,Nsamples):

pl_1  = 1
zer_0 = 0

log2  = np.log(2.0)

Nv = W.shape[0] - 1
Nh = W.shape[1] - 1

Wt = W.T

vvk   = np.random.randint(zer_0,pl_1+1,[Nsamples,Nv+1])
vvk[:,1]  = pl_1
vv_k  = np.random.randint(zer_0,pl_1+1,[Nsamples,Nv+1])
vv_k[:,1] = pl_1
size_v = vv_k.shape

hhk   = np.random.randint(zer_0,pl_1+1,[Nsamples,Nh+1])
hhk[:,1] = pl_1
hh_k  = np.random.randint(zer_0,pl_1+1,[Nsamples,Nh+1])
hh_k[:,1] = pl_1
size_h = hh_k.shape

z1      = np.zeros(hhk.shape)
z2      = np.zeros(vvk.shape)

for i in range(5):

z1 = 1/(1+np.exp(-np.dot(vv_k,W)));
hh_k = np.zeros(size_h); hh_k[z1 > np.random.rand(size_h[0],size_h[1])] = pl_1

z2 = 1/(1+np.exp(-np.dot(hh_k,Wt)));
vv_k = np.zeros(size_v); vv_k[z2 > np.random.rand(size_v[0],size_v[1])] = pl_1

if __name__ == '__main__':
W = np.random.rand(301,901)
t = TEST()
itime = time.time()
t.TEST_ais(W,1024);
print("  took %.1fs" % (time.time()-itime))
sys.stdout.flush()

``````

Now the fact is that, executing the julia code gives

``````julia < TEST.jl
elapsed time: 2.875419733 seconds
elapsed time: 1.55660307 seconds
``````

while the Python version turns out to be faster

``````python TEST.py
took 2.0s
``````

…and this is with the default python that comes with my ubuntu machine. Running the same code using the Anaconda python blows my mind up, giving

``````> python TEST.py
took 0.3s
``````

I am quite surprised about the difference between the two python’s, but most horrified by the fact that the Anaconda python version is about 5.5 times faster than the Julia code

So I am sure I’m not optimizing my julia code, despite how simple it is. Can somebody here provide a hint about how to make the julia code as fast as the Anaconda one?

Ferran.

I haven’t run it, but the `Int32` stuff looks like an obvious problem. What is this initial line for?

``````    pl_1  = Int32(1)
zer_0 = Int32(0)
pl_1  = 1
zer_0 = 0
``````

This will introduce a type instability since `pl_1 = 1` changes the type of `pl_1` from `Int32` to `Int64`. See https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-changing-the-type-of-a-variable-1.

Try just removing the first two lines. There’s no reason for you to use `Int32` anywhere.

Sorry the Int32 lines were left from multiple tries, and got there abandoned in the snipped part of the code pasted here.
The performance times posted were obtained without these lines…

Sorry for the mistake. The query still holds…

Here’s my attempt, I removed the Int32 business, and devectorized the part in the loop:

``````
function TEST_ais(W,Nsamples)

pl_1  = 1.0
zer_0 = 0.0

log2  = log(2.0)
Nv, Nh = size(W) .- 1
Wt = transpose(W)

vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
vvk[:,1]  = pl_1
vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
vv_k[:,1] = pl_1
size_v = size(vv_k)

hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
hhk[:,1] = pl_1
hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
hh_k[:,1] = pl_1
size_h = size(hh_k)

t = x-> 1.0/(1.0+exp(-x))

@inbounds for i in 1:2

hh_k = zeros(size_h)
hh_sel = vv_k*W
for k in eachindex(hh_sel)
hh_k[k] = t(hh_sel[k]) > rand() ? pl_1 : 0.0
end

vv_k = zeros(size_v)
vv_sel = hh_k*Wt
for k in eachindex(vv_sel)
vv_k[k] = t(vv_sel[k]) > rand() ? pl_1 : 0.0
end
end
end

W = rand(301,901)

TEST_ais(W,1024)
@time TEST_ais(W,1024)
``````

The first time given by Julia includes compile time. You should use BenchmarkTools.jl. Granted, I’m still not sure why the Anaconda version is so much faster. At first glance it looks suspiciously like something is being elided, but I’m not sure how that can happen in Python.

1 Like

Again, please don’t use `@time`, this will measure compile time.

Yes thanks, I did all that… Julia run times (using time) are the ones reported after several attempts, as I’m aware of the compilation implied I. The first run…

I’m sure it is something about the code itself. I don’t want to start believing Anaconda python is so much faster than Julia… Or should I?

Best regards and thanks again,

Ferran.

On my computer it only takes around 150ms (I’m on 1.0.1):

``````julia> function TEST_ais(W,Nsamples)

pl_1  = 1.0
zer_0 = 0.0

log2  = log(2.0)
Nv, Nh = size(W) .- 1
Wt = transpose(W)

vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
vvk[:,1]  .= pl_1
vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
vv_k[:,1] .= pl_1
size_v = size(vv_k)

hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
hhk[:,1] .= pl_1
hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
hh_k[:,1] .= pl_1
size_h = size(hh_k)

t = x-> 1.0/(1.0+exp(-x))

@inbounds for i in 1:2

hh_k = zeros(size_h)
hh_sel = vv_k*W
for k in eachindex(hh_sel)
hh_k[k] = t(hh_sel[k]) > rand() ? pl_1 : 0.0
end

vv_k = zeros(size_v)
vv_sel = hh_k*Wt
for k in eachindex(vv_sel)
vv_k[k] = t(vv_sel[k]) > rand() ? pl_1 : 0.0
end
end
end
TEST_ais (generic function with 1 method)

julia> W = rand(301,901);

julia> @btime TEST_ais(W,1024)
149.262 ms (28 allocations: 56.35 MiB)
``````

As long as you time the 2nd run, I don’t think `@time` will be a problem. (though definitely `@benchmark` from BenchmarkTools.jl will give you much better and more useful data).

In the past I’ve run into an issue where performance wasn’t very good when indexing by arrays of booleans, which is a common pattern in languages like Python and Matlab where `for` loops are slow. I’m not sure if performance has improved there, but if not they could be tripping you up.

@Ferran_Mazzanti did you try @jonathanBieler’s or @laborg’s versions?

Writing super performant code is not in general easy in any language. Julia gives you all sorts of tools for doing it, but it’s not a magic bullet.

I also strongly suggest moving to 1.0.1, there have been a ton of changes and optimizations between 0.6.4 and 1.0.1. If you still get legitimately slower times than Python in 1.0.1, ultimately it will mean there is a bug that we need to fix. (Looks like @laborg has already gotten it going quite fast.)

(Sorry, I still haven’t had a moment to mess with this myself.)

I just fixed the broadcast syntax where necessary and haven’t tried to speed it up. Before optimizing it would be good to clean up the function (e.g.`vvk` is just written but never read) and have it return something at all.

We just improved logical indexing performance by as much as 10x on master. I’ve not reviewed this example closely but if that’s a bottleneck it might be interesting to try master.

4 Likes

Hola Ferran :),

Let another friend of yours write here a silly monte carlo simulation where Julia is 100x faster than “the same code” in Python. To be fair the version is 10x faster than “a sensible Python vectorized version” of the code (that anyone used to code in Python/Numpy would do).

Programming in the exact same way than in Python can be a problem. This was also a headache for me when I started using Julia because I used to create a lot of big matrices and do stuff column/row -wise without actually even considering If I really needed to create those matrices. It just felt a natural thing to do to create arrays.

Looking backwards now I see a lot of bottlenecks in my Python code that were simply too painful to address in Python that now I can easily improve in Julia. I hope that the code that I include in this post, `compute_pi.py` and `compute_pi_2.py` bring some light on this issue.

There are several “introduction to Julia” videos but I haven’t seen any youtube tutorial where someone goes through several examples explaining how to adapt your code from numpy like sintax to Julia.

If I have time I might do one tutorial that dives into this issues. I feel it would be very beneficial to a lot of scientists that want code that runs fast but have a difficult time translating code from other languages.

#### Be careful with misleading translations

Line by line translations from a programming language to another can be misleading. It’s specially easy to miss something when the same syntax in valid in 2 languages but the underlying action ends up beeing different.

For example if `X` is a Numpy array the code `X[1:10,:]` should be translated in julia as `view(X,1:10,:)` because `X[1:10,:]` in julia creates a copy and in numpy creates a view. Small details like this can hurt a lot performance. The cool thing is that julia exposes all this to the programmer without actually any need to cythonize your code.

#### Some tips that were helpful to me

It is quite common to have problems where we know that some matrices will change the content but not the shape (parameters in machine learning models for example). If a function is called many times and inside the function you have one `rand(N,M)` (or several), probably you can exploit this and, as long as N and M are fixed throughout the execution of the program, you could prealocate `X=rand(N,M)` and then do `rand!(X)` passing `X` as argument to your function.

When you end up having several arrays that you want to refill inside your function it can also be helpful to group them in a `struct`. Something like `matrix_placeholders` where you can save all the “state space” of your problem and reuse it.

#### An illustrative example

Here I can write an example where julia is 10x faster than numpy “vectorized like implementation”. In this case, Julia is faster than numpy even though julia is using a single thread (and numpy using 8 threads al at max!).

Notice that the python version is almost 100x slower than the “translated line by line from julia” version.

In this case `compute_pi_2.py` felt the natural thing to do in numpy. If I want to throw a lot of darts into a square and count how many they end up inside/outside a circle I can create a big matrix simulating all darts. Then check for every row if the dart ends up inside or outside the circle. Since a for loop is a big bottleneck in python it’s sensible to avoid it generating a big matrix. Nevertheless the big matrix was never actually needed and it takes a lot of time (and memory) to create matrices. Usually creating and destroying matrices can be more expensive than actually computing stuff with the matrices.

• `compute_pi.py` is a “direct” translation from `compute_pi.jl`.

• `compute_pi_2.py` is a “vectorized” translation of `compute_pi.jl` where the for loop was eliminated.

• `compute_pi.py` takes 38 sec (uses a single thread)

• `compute_pi_2.py` takes 4.8 sec (uses all threads)

• `compute_pi.jl` takes 0.4 sec (uses a single thread)

#### compute_pi.py

``````import time
import random

def compute_pi(N):
"""
Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
Returns estimate of pi
"""
n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle
for i in range(N):
x = random.random() * 2 - 1  # uniformly distributed number on x-axis
y = random.random() * 2 - 1  # uniformly distributed number on y-axis
if r2 < 1.0:
n_landed_in_circle += 1.

return (n_landed_in_circle / N) * 4.0

N= 100_000_000
t0 = time.time()
print("Start Computing")
pi = compute_pi(N)
print("Approximate pi: {}".format(pi))
print( "Total time: {} seconds".format(abs(t0- time.time())))
``````

Start Computing
Approximate pi: 3.14216348
Total time: 38.98658728599548 seconds

#### compute_pi_2.py

``````import time
import random
import numpy as np

def compute_pi(N):
"""
Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
Returns estimate of pi
"""
n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle

# no for loop because you know python does not like for loops
p = np.random.rand(N,2) * 2 - 1
r2 = p[:,0]* p[:,0] + p[:,1]*p[:,1]  # radius squared, in radial coordinates
idx = r2 < 1
return (np.sum(idx) / N) * 4.0

N= 100_000_000
t0 = time.time()
print("Start Computing")
pi = compute_pi(N)
print("Approximate pi: {}".format(pi))
print( "Total time: {} seconds".format(abs(t0- time.time())))
``````

Start Computing
Approximate pi: 3.14153556
Total time: 4.595675945281982 seconds

#### compute_pi.jl

``````function compute_pi(N::Int)
"""
Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
Returns estimate of pi
"""
n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle
for i = 1:N
x = rand() * 2 - 1  # uniformly distributed number on x-axis
y = rand() * 2 - 1  # uniformly distributed number on y-axis

if r2 < 1.0
n_landed_in_circle += 1
end
end

return n_landed_in_circle / N * 4.0
end

compute_pi(2)
N= 100_000_000
t0 = time()
println("Start Computing")
aprox = compute_pi(N)
println("Approximate pi: ", aprox)
println("Total time: ", abs(time()-t0), " seconds")
``````

Start Computing
Approximate pi: π = 3.1415…
Total time: 0.4338340759277344 seconds

8 Likes

Anaconda´s python is shipped with MKL (https://github.com/conda/conda/issues/2032)
So vectorized code can be send straight to very efficient routines.

Well on latest master its a little bit slower:

``````julia> @btime TEST_ais(W,1024)
175.698 ms (28 allocations: 56.35 MiB)
``````

I’m wondering if this should be registered as regression issue on github.

Your function is a bit difficult to optimize, because it’s hard to understand what it’s supposed to be doing, and it has a lot of ‘dead code’.

For example, `vvk` and `hhk` are initialized to random matrices, and the first column mutated, but then they are abandoned and never referenced again. Can we just remove that? `hh_k` is also randomized, but then later replaced with a zero matrix. Can we remove the pointless stuff?

Also, your function returns no values and does not modify anything that is visible outside the function. We could just replace your entire function with this:

``````function TEST_ais(W, Nsamples)
return nothing
end
``````

That would be fast.

How can your function be optimized? What do you want it to do?

4 Likes

Here is a function that does something similar to what your code does. It’s somewhat fast:

``````function TEST_ais(W::AbstractMatrix{T}, Nsamples) where {T<:Real}

pl_1  = 1
zer_0 = 0

Nv, Nh = size(W) .- 1

Wt = transpose(W)

vv_k = rand([zer_0, pl_1], Nsamples, Nv+1)
vv_k[:,1] .= pl_1

hh_k = Matrix{T}(undef, Nsamples, Nh+1)
vv_k = Matrix{T}(undef, Nsamples, Nv+1)

for i in 1:2
A = -vv_k * W
hh_k .= 0
hh_k[1 ./ (1 .+ exp.(A)) .> rand.()] .= pl_1

B = -hh_k * Wt
vv_k .= 0
vv_k[1 ./ (1 .+ exp.(B)) .> rand.()] .= pl_1
end

end
``````

In general, though, vectorized array math is fast in Numpy. It’s written in C, after all. You shouldn’t expect Julia to always be faster than that. Same thing with Matlab. It is already highly optimized (and multi-threaded!) in those languages. You should expect Julia to more-or-less match those languages in array math, but otherwise beat them.

3 Likes

Thanks guys for all the comments and suggestions… I have to find a bit of time to swallow all that and see where is the bottleneck in my code, comparing to your solutions. Still I can answer a few of the questions exposed: 1) the code I posted has useless declarations and returns nothing because the original function was way more complicated, and I stripped down parts to reduce it to something simple were timming problems are evident in the comparison 2) there is no need for a function to return anything if you are only timming the calculation performance 3) I wrote vectorized code because, in the past, we were told to do so and use the . syntax as one of the strengths of Julia (hope this has not changed, but I’ve seen so many things chnaging here that I can’t tell what’s the status now) 4) I do not expect Julia to best numpy on MKL libraries, but to perform similarly at least. My Julia code was timming about 2secs, the Anaconda translation took 0.2secs… I don’t think this is a similar performance (at least not good enough for me).

But since I have to check and learn from your solutions, the only thing I can tell you now is: thanks for your help

I will check the various versions in the next days. Still, if I have to do much complicated work to reach python’s performance, it may not be worth the effort. In general I think programming in Julia is becoming more and more complicated with each new version… I don’t know.

Best regards,

Ferran

MKL is multithreaded …There’s no way serial julia will best that for array ops.

A fair comparison would be using multithreaded cpu array @sdanisch Is there an implementation under GPU array as I recall?

And of course, it should be possible to compile Julia with MKL.

FWIW, I took the function definition from the very first post and benchmarked it without any changes with

• Julia + OpenBlas
• Julia + MKL
• Julia + MKL (with 4 threads)

(Julia is always v0.6.4)

The performance was almost identical, always around ~1.8 s on my machine.