Quite bad performance of Julia 0.6.4 vs Python+Numpy

Dear all,

I guess I could use a piece of good help from the experts here. The fact is that I was about to translate a piece of Monte Carlo code from my old 0.6.4 program library, into Julia 1.0 (going through 0.7 first), until a friend of mine tried essentially the same code under Anaconda Python on the same computer. I was so disappointed by the results of test runs that I’m wondering what could be wrong on my code.

I post here the Julia and the Python codes:

Julia 0.6.4:

function TEST_ais(W,Nsamples)

    pl_1  = Int32(1)
    zer_0 = Int32(0) 
    pl_1  = 1
    zer_0 = 0  
    
    log2  = log(2.0)

    Nv, Nh = size(W) .- 1

    Wt = transpose(W)
   
    vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
    vvk[:,1]  = pl_1 
    vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
    vv_k[:,1] = pl_1
    size_v = size(vv_k)
    
    hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
    hhk[:,1] = pl_1
    hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
    hh_k[:,1] = pl_1
    size_h = size(hh_k)
   
    for i in 1:2
    
        hh_k = zeros(Int32, size_h); hh_k[1.0./(1.0+exp.(-vv_k*W)) .> rand(size_h)] .= pl_1
        vv_k = zeros(Int32, size_v); vv_k[1.0./(1.0+exp.(-hh_k*Wt)) .> rand(size_v)] .= pl_1
        
    end
        
end;

W = rand(301,901)

tic()
TEST_ais(W,1024)
toc()

tic()
TEST_ais(W,1024)
toc()

Python:

import numpy as np
import sys, time

class TEST(object):

    def __init__(self):

        return

    def TEST_ais(self,W,Nsamples):

        pl_1  = 1
        zer_0 = 0
    
        log2  = np.log(2.0)

        Nv = W.shape[0] - 1
        Nh = W.shape[1] - 1
    
        Wt = W.T
    
        vvk   = np.random.randint(zer_0,pl_1+1,[Nsamples,Nv+1])
        vvk[:,1]  = pl_1 
        vv_k  = np.random.randint(zer_0,pl_1+1,[Nsamples,Nv+1])
        vv_k[:,1] = pl_1
        size_v = vv_k.shape
    
        hhk   = np.random.randint(zer_0,pl_1+1,[Nsamples,Nh+1])
        hhk[:,1] = pl_1
        hh_k  = np.random.randint(zer_0,pl_1+1,[Nsamples,Nh+1])
        hh_k[:,1] = pl_1
        size_h = hh_k.shape

        z1      = np.zeros(hhk.shape)
        z2      = np.zeros(vvk.shape)
    
        for i in range(5):
    
            z1 = 1/(1+np.exp(-np.dot(vv_k,W)));
            hh_k = np.zeros(size_h); hh_k[z1 > np.random.rand(size_h[0],size_h[1])] = pl_1

            z2 = 1/(1+np.exp(-np.dot(hh_k,Wt)));
            vv_k = np.zeros(size_v); vv_k[z2 > np.random.rand(size_v[0],size_v[1])] = pl_1
                
if __name__ == '__main__':
    W = np.random.rand(301,901)
    t = TEST()
    itime = time.time()
    t.TEST_ais(W,1024);
    print("  took %.1fs" % (time.time()-itime))
    sys.stdout.flush()

Now the fact is that, executing the julia code gives

julia < TEST.jl 
elapsed time: 2.875419733 seconds
elapsed time: 1.55660307 seconds

while the Python version turns out to be faster

python TEST.py 
  took 2.0s

…and this is with the default python that comes with my ubuntu machine. Running the same code using the Anaconda python blows my mind up, giving

> python TEST.py
  took 0.3s

I am quite surprised about the difference between the two python’s, but most horrified by the fact that the Anaconda python version is about 5.5 times faster than the Julia code :frowning:

So I am sure I’m not optimizing my julia code, despite how simple it is. Can somebody here provide a hint about how to make the julia code as fast as the Anaconda one?

Thanks in advance for your kind help,

Ferran.

I haven’t run it, but the Int32 stuff looks like an obvious problem. What is this initial line for?

    pl_1  = Int32(1)
    zer_0 = Int32(0) 
    pl_1  = 1
    zer_0 = 0  

This will introduce a type instability since pl_1 = 1 changes the type of pl_1 from Int32 to Int64. See Performance Tips · The Julia Language.

Try just removing the first two lines. There’s no reason for you to use Int32 anywhere.

Sorry the Int32 lines were left from multiple tries, and got there abandoned in the snipped part of the code pasted here.
The performance times posted were obtained without these lines…

Sorry for the mistake. The query still holds…

…and thanks for your comment :frowning:

Here’s my attempt, I removed the Int32 business, and devectorized the part in the loop:


function TEST_ais(W,Nsamples)

    pl_1  = 1.0
    zer_0 = 0.0
    
    log2  = log(2.0)
    Nv, Nh = size(W) .- 1
    Wt = transpose(W)
   
    vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
    vvk[:,1]  = pl_1 
    vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
    vv_k[:,1] = pl_1
    size_v = size(vv_k)
    
    hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
    hhk[:,1] = pl_1
    hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
    hh_k[:,1] = pl_1
    size_h = size(hh_k)
   
    t = x-> 1.0/(1.0+exp(-x))

    @inbounds for i in 1:2
    
        hh_k = zeros(size_h)
        hh_sel = vv_k*W
        for k in eachindex(hh_sel)
            hh_k[k] = t(hh_sel[k]) > rand() ? pl_1 : 0.0
        end

        vv_k = zeros(size_v)
        vv_sel = hh_k*Wt        
        for k in eachindex(vv_sel)
            vv_k[k] = t(vv_sel[k]) > rand() ? pl_1 : 0.0
        end
    end
end

W = rand(301,901)

TEST_ais(W,1024)
@time TEST_ais(W,1024)

The first time given by Julia includes compile time. You should use BenchmarkTools.jl. Granted, I’m still not sure why the Anaconda version is so much faster. At first glance it looks suspiciously like something is being elided, but I’m not sure how that can happen in Python.

1 Like

Again, please don’t use @time, this will measure compile time.

Yes thanks, I did all that… Julia run times (using time) are the ones reported after several attempts, as I’m aware of the compilation implied I. The first run…

I’m sure it is something about the code itself. I don’t want to start believing Anaconda python is so much faster than Julia… Or should I?

Best regards and thanks again,

Ferran.

On my computer it only takes around 150ms (I’m on 1.0.1):

julia> function TEST_ais(W,Nsamples)

           pl_1  = 1.0
           zer_0 = 0.0

           log2  = log(2.0)
           Nv, Nh = size(W) .- 1
           Wt = transpose(W)

           vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
           vvk[:,1]  .= pl_1
           vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
           vv_k[:,1] .= pl_1
           size_v = size(vv_k)

           hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
           hhk[:,1] .= pl_1
           hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
           hh_k[:,1] .= pl_1
           size_h = size(hh_k)

           t = x-> 1.0/(1.0+exp(-x))

           @inbounds for i in 1:2

               hh_k = zeros(size_h)
               hh_sel = vv_k*W
               for k in eachindex(hh_sel)
                   hh_k[k] = t(hh_sel[k]) > rand() ? pl_1 : 0.0
               end

               vv_k = zeros(size_v)
               vv_sel = hh_k*Wt
               for k in eachindex(vv_sel)
                   vv_k[k] = t(vv_sel[k]) > rand() ? pl_1 : 0.0
               end
           end
       end
TEST_ais (generic function with 1 method)

julia> W = rand(301,901);

julia> @btime TEST_ais(W,1024)
  149.262 ms (28 allocations: 56.35 MiB)

As long as you time the 2nd run, I don’t think @time will be a problem. (though definitely @benchmark from BenchmarkTools.jl will give you much better and more useful data).

In the past I’ve run into an issue where performance wasn’t very good when indexing by arrays of booleans, which is a common pattern in languages like Python and Matlab where for loops are slow. I’m not sure if performance has improved there, but if not they could be tripping you up.

@Ferran_Mazzanti did you try @jonathanBieler’s or @laborg’s versions?

Writing super performant code is not in general easy in any language. Julia gives you all sorts of tools for doing it, but it’s not a magic bullet.

I also strongly suggest moving to 1.0.1, there have been a ton of changes and optimizations between 0.6.4 and 1.0.1. If you still get legitimately slower times than Python in 1.0.1, ultimately it will mean there is a bug that we need to fix. (Looks like @laborg has already gotten it going quite fast.)

(Sorry, I still haven’t had a moment to mess with this myself.)

I just fixed the broadcast syntax where necessary and haven’t tried to speed it up. Before optimizing it would be good to clean up the function (e.g.vvk is just written but never read) and have it return something at all.

We just improved logical indexing performance by as much as 10x on master. I’ve not reviewed this example closely but if that’s a bottleneck it might be interesting to try master.

4 Likes

Hola Ferran :),

Let another friend of yours write here a silly monte carlo simulation where Julia is 100x faster than “the same code” in Python. To be fair the version is 10x faster than “a sensible Python vectorized version” of the code (that anyone used to code in Python/Numpy would do).

Programming in the exact same way than in Python can be a problem. This was also a headache for me when I started using Julia because I used to create a lot of big matrices and do stuff column/row -wise without actually even considering If I really needed to create those matrices. It just felt a natural thing to do to create arrays.

Looking backwards now I see a lot of bottlenecks in my Python code that were simply too painful to address in Python that now I can easily improve in Julia. I hope that the code that I include in this post, compute_pi.py and compute_pi_2.py bring some light on this issue.

There are several “introduction to Julia” videos but I haven’t seen any youtube tutorial where someone goes through several examples explaining how to adapt your code from numpy like sintax to Julia.

If I have time I might do one tutorial that dives into this issues. I feel it would be very beneficial to a lot of scientists that want code that runs fast but have a difficult time translating code from other languages.

Be careful with misleading translations

Line by line translations from a programming language to another can be misleading. It’s specially easy to miss something when the same syntax in valid in 2 languages but the underlying action ends up beeing different.

For example if X is a Numpy array the code X[1:10,:] should be translated in julia as view(X,1:10,:) because X[1:10,:] in julia creates a copy and in numpy creates a view. Small details like this can hurt a lot performance. The cool thing is that julia exposes all this to the programmer without actually any need to cythonize your code.

Some tips that were helpful to me

It is quite common to have problems where we know that some matrices will change the content but not the shape (parameters in machine learning models for example). If a function is called many times and inside the function you have one rand(N,M) (or several), probably you can exploit this and, as long as N and M are fixed throughout the execution of the program, you could prealocate X=rand(N,M) and then do rand!(X) passing X as argument to your function.

When you end up having several arrays that you want to refill inside your function it can also be helpful to group them in a struct. Something like matrix_placeholders where you can save all the “state space” of your problem and reuse it.

An illustrative example

Here I can write an example where julia is 10x faster than numpy “vectorized like implementation”. In this case, Julia is faster than numpy even though julia is using a single thread (and numpy using 8 threads al at max!).

Notice that the python version is almost 100x slower than the “translated line by line from julia” version.

In this case compute_pi_2.py felt the natural thing to do in numpy. If I want to throw a lot of darts into a square and count how many they end up inside/outside a circle I can create a big matrix simulating all darts. Then check for every row if the dart ends up inside or outside the circle. Since a for loop is a big bottleneck in python it’s sensible to avoid it generating a big matrix. Nevertheless the big matrix was never actually needed and it takes a lot of time (and memory) to create matrices. Usually creating and destroying matrices can be more expensive than actually computing stuff with the matrices.

Comments

  • compute_pi.py is a “direct” translation from compute_pi.jl.

  • compute_pi_2.py is a “vectorized” translation of compute_pi.jl where the for loop was eliminated.

  • compute_pi.py takes 38 sec (uses a single thread)

  • compute_pi_2.py takes 4.8 sec (uses all threads)

  • compute_pi.jl takes 0.4 sec (uses a single thread)

compute_pi.py

import time
import random

def compute_pi(N):
    """
    Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
    Returns estimate of pi
    """
    n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle
    for i in range(N):
        x = random.random() * 2 - 1  # uniformly distributed number on x-axis
        y = random.random() * 2 - 1  # uniformly distributed number on y-axis
        r2 = x*x + y*y  # radius squared, in radial coordinates
        if r2 < 1.0:
            n_landed_in_circle += 1.

    return (n_landed_in_circle / N) * 4.0

N= 100_000_000
t0 = time.time()
print("Start Computing")
pi = compute_pi(N)
print("Approximate pi: {}".format(pi))
print( "Total time: {} seconds".format(abs(t0- time.time()))) 

Start Computing
Approximate pi: 3.14216348
Total time: 38.98658728599548 seconds

compute_pi_2.py

import time
import random
import numpy as np

def compute_pi(N):
    """
    Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
    Returns estimate of pi
    """
    n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle

    # no for loop because you know python does not like for loops
    p = np.random.rand(N,2) * 2 - 1 
    r2 = p[:,0]* p[:,0] + p[:,1]*p[:,1]  # radius squared, in radial coordinates
    idx = r2 < 1
    return (np.sum(idx) / N) * 4.0

N= 100_000_000
t0 = time.time()
print("Start Computing")
pi = compute_pi(N)
print("Approximate pi: {}".format(pi))
print( "Total time: {} seconds".format(abs(t0- time.time()))) 

Start Computing
Approximate pi: 3.14153556
Total time: 4.595675945281982 seconds

compute_pi.jl

function compute_pi(N::Int)
    """
    Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
    Returns estimate of pi
    """
    n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle
    for i = 1:N
        x = rand() * 2 - 1  # uniformly distributed number on x-axis
        y = rand() * 2 - 1  # uniformly distributed number on y-axis

        r2 = x*x + y*y  # radius squared, in radial coordinates
        if r2 < 1.0
            n_landed_in_circle += 1
        end
    end

    return n_landed_in_circle / N * 4.0    
end

compute_pi(2)
N= 100_000_000
t0 = time()
println("Start Computing")
aprox = compute_pi(N)
println("Approximate pi: ", aprox)
println("Total time: ", abs(time()-t0), " seconds")

Start Computing
Approximate pi: π = 3.1415…
Total time: 0.4338340759277344 seconds

8 Likes

About anaconda being faster…
Anaconda´s python is shipped with MKL (https://github.com/conda/conda/issues/2032)
So vectorized code can be send straight to very efficient routines.

Well on latest master its a little bit slower:

julia> @btime TEST_ais(W,1024)
  175.698 ms (28 allocations: 56.35 MiB)

I’m wondering if this should be registered as regression issue on github.

Your function is a bit difficult to optimize, because it’s hard to understand what it’s supposed to be doing, and it has a lot of ‘dead code’.

For example, vvk and hhk are initialized to random matrices, and the first column mutated, but then they are abandoned and never referenced again. Can we just remove that? hh_k is also randomized, but then later replaced with a zero matrix. Can we remove the pointless stuff?

Also, your function returns no values and does not modify anything that is visible outside the function. We could just replace your entire function with this:

function TEST_ais(W, Nsamples)
    return nothing
end

That would be fast.

How can your function be optimized? What do you want it to do?

4 Likes

Here is a function that does something similar to what your code does. It’s somewhat fast:

function TEST_ais(W::AbstractMatrix{T}, Nsamples) where {T<:Real}

    pl_1  = 1
    zer_0 = 0  
    
    Nv, Nh = size(W) .- 1

    Wt = transpose(W)
   
    vv_k = rand([zer_0, pl_1], Nsamples, Nv+1)
    vv_k[:,1] .= pl_1
    
    hh_k = Matrix{T}(undef, Nsamples, Nh+1)
    vv_k = Matrix{T}(undef, Nsamples, Nv+1)
    
    for i in 1:2
        A = -vv_k * W
        hh_k .= 0
        hh_k[1 ./ (1 .+ exp.(A)) .> rand.()] .= pl_1
        
        B = -hh_k * Wt
        vv_k .= 0
        vv_k[1 ./ (1 .+ exp.(B)) .> rand.()] .= pl_1
    end
        
end

In general, though, vectorized array math is fast in Numpy. It’s written in C, after all. You shouldn’t expect Julia to always be faster than that. Same thing with Matlab. It is already highly optimized (and multi-threaded!) in those languages. You should expect Julia to more-or-less match those languages in array math, but otherwise beat them.

3 Likes

Thanks guys for all the comments and suggestions… I have to find a bit of time to swallow all that and see where is the bottleneck in my code, comparing to your solutions. Still I can answer a few of the questions exposed: 1) the code I posted has useless declarations and returns nothing because the original function was way more complicated, and I stripped down parts to reduce it to something simple were timming problems are evident in the comparison 2) there is no need for a function to return anything if you are only timming the calculation performance 3) I wrote vectorized code because, in the past, we were told to do so and use the . syntax as one of the strengths of Julia (hope this has not changed, but I’ve seen so many things chnaging here that I can’t tell what’s the status now) 4) I do not expect Julia to best numpy on MKL libraries, but to perform similarly at least. My Julia code was timming about 2secs, the Anaconda translation took 0.2secs… I don’t think this is a similar performance (at least not good enough for me).

But since I have to check and learn from your solutions, the only thing I can tell you now is: thanks for your help :slight_smile:

I will check the various versions in the next days. Still, if I have to do much complicated work to reach python’s performance, it may not be worth the effort. In general I think programming in Julia is becoming more and more complicated with each new version… I don’t know.

Best regards,

Ferran

MKL is multithreaded …There’s no way serial julia will best that for array ops.

A fair comparison would be using multithreaded cpu array @sdanisch Is there an implementation under GPU array as I recall?

And of course, it should be possible to compile Julia with MKL.

FWIW, I took the function definition from the very first post and benchmarked it without any changes with

  • Julia + OpenBlas
  • Julia + MKL
  • Julia + MKL (with 4 threads)

(Julia is always v0.6.4)

The performance was almost identical, always around ~1.8 s on my machine.