Quite bad performance of Julia 0.6.4 vs Python+Numpy

On my computer it only takes around 150ms (I’m on 1.0.1):

julia> function TEST_ais(W,Nsamples)

           pl_1  = 1.0
           zer_0 = 0.0

           log2  = log(2.0)
           Nv, Nh = size(W) .- 1
           Wt = transpose(W)

           vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
           vvk[:,1]  .= pl_1
           vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
           vv_k[:,1] .= pl_1
           size_v = size(vv_k)

           hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
           hhk[:,1] .= pl_1
           hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
           hh_k[:,1] .= pl_1
           size_h = size(hh_k)

           t = x-> 1.0/(1.0+exp(-x))

           @inbounds for i in 1:2

               hh_k = zeros(size_h)
               hh_sel = vv_k*W
               for k in eachindex(hh_sel)
                   hh_k[k] = t(hh_sel[k]) > rand() ? pl_1 : 0.0
               end

               vv_k = zeros(size_v)
               vv_sel = hh_k*Wt
               for k in eachindex(vv_sel)
                   vv_k[k] = t(vv_sel[k]) > rand() ? pl_1 : 0.0
               end
           end
       end
TEST_ais (generic function with 1 method)

julia> W = rand(301,901);

julia> @btime TEST_ais(W,1024)
  149.262 ms (28 allocations: 56.35 MiB)

As long as you time the 2nd run, I don’t think @time will be a problem. (though definitely @benchmark from BenchmarkTools.jl will give you much better and more useful data).

In the past I’ve run into an issue where performance wasn’t very good when indexing by arrays of booleans, which is a common pattern in languages like Python and Matlab where for loops are slow. I’m not sure if performance has improved there, but if not they could be tripping you up.

@Ferran_Mazzanti did you try @jonathanBieler’s or @laborg’s versions?

Writing super performant code is not in general easy in any language. Julia gives you all sorts of tools for doing it, but it’s not a magic bullet.

I also strongly suggest moving to 1.0.1, there have been a ton of changes and optimizations between 0.6.4 and 1.0.1. If you still get legitimately slower times than Python in 1.0.1, ultimately it will mean there is a bug that we need to fix. (Looks like @laborg has already gotten it going quite fast.)

(Sorry, I still haven’t had a moment to mess with this myself.)

I just fixed the broadcast syntax where necessary and haven’t tried to speed it up. Before optimizing it would be good to clean up the function (e.g.vvk is just written but never read) and have it return something at all.

We just improved logical indexing performance by as much as 10x on master. I’ve not reviewed this example closely but if that’s a bottleneck it might be interesting to try master.

4 Likes

Hola Ferran :),

Let another friend of yours write here a silly monte carlo simulation where Julia is 100x faster than “the same code” in Python. To be fair the version is 10x faster than “a sensible Python vectorized version” of the code (that anyone used to code in Python/Numpy would do).

Programming in the exact same way than in Python can be a problem. This was also a headache for me when I started using Julia because I used to create a lot of big matrices and do stuff column/row -wise without actually even considering If I really needed to create those matrices. It just felt a natural thing to do to create arrays.

Looking backwards now I see a lot of bottlenecks in my Python code that were simply too painful to address in Python that now I can easily improve in Julia. I hope that the code that I include in this post, compute_pi.py and compute_pi_2.py bring some light on this issue.

There are several “introduction to Julia” videos but I haven’t seen any youtube tutorial where someone goes through several examples explaining how to adapt your code from numpy like sintax to Julia.

If I have time I might do one tutorial that dives into this issues. I feel it would be very beneficial to a lot of scientists that want code that runs fast but have a difficult time translating code from other languages.

Be careful with misleading translations

Line by line translations from a programming language to another can be misleading. It’s specially easy to miss something when the same syntax in valid in 2 languages but the underlying action ends up beeing different.

For example if X is a Numpy array the code X[1:10,:] should be translated in julia as view(X,1:10,:) because X[1:10,:] in julia creates a copy and in numpy creates a view. Small details like this can hurt a lot performance. The cool thing is that julia exposes all this to the programmer without actually any need to cythonize your code.

Some tips that were helpful to me

It is quite common to have problems where we know that some matrices will change the content but not the shape (parameters in machine learning models for example). If a function is called many times and inside the function you have one rand(N,M) (or several), probably you can exploit this and, as long as N and M are fixed throughout the execution of the program, you could prealocate X=rand(N,M) and then do rand!(X) passing X as argument to your function.

When you end up having several arrays that you want to refill inside your function it can also be helpful to group them in a struct. Something like matrix_placeholders where you can save all the “state space” of your problem and reuse it.

An illustrative example

Here I can write an example where julia is 10x faster than numpy “vectorized like implementation”. In this case, Julia is faster than numpy even though julia is using a single thread (and numpy using 8 threads al at max!).

Notice that the python version is almost 100x slower than the “translated line by line from julia” version.

In this case compute_pi_2.py felt the natural thing to do in numpy. If I want to throw a lot of darts into a square and count how many they end up inside/outside a circle I can create a big matrix simulating all darts. Then check for every row if the dart ends up inside or outside the circle. Since a for loop is a big bottleneck in python it’s sensible to avoid it generating a big matrix. Nevertheless the big matrix was never actually needed and it takes a lot of time (and memory) to create matrices. Usually creating and destroying matrices can be more expensive than actually computing stuff with the matrices.

Comments

  • compute_pi.py is a “direct” translation from compute_pi.jl.

  • compute_pi_2.py is a “vectorized” translation of compute_pi.jl where the for loop was eliminated.

  • compute_pi.py takes 38 sec (uses a single thread)

  • compute_pi_2.py takes 4.8 sec (uses all threads)

  • compute_pi.jl takes 0.4 sec (uses a single thread)

compute_pi.py

import time
import random

def compute_pi(N):
    """
    Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
    Returns estimate of pi
    """
    n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle
    for i in range(N):
        x = random.random() * 2 - 1  # uniformly distributed number on x-axis
        y = random.random() * 2 - 1  # uniformly distributed number on y-axis
        r2 = x*x + y*y  # radius squared, in radial coordinates
        if r2 < 1.0:
            n_landed_in_circle += 1.

    return (n_landed_in_circle / N) * 4.0

N= 100_000_000
t0 = time.time()
print("Start Computing")
pi = compute_pi(N)
print("Approximate pi: {}".format(pi))
print( "Total time: {} seconds".format(abs(t0- time.time()))) 

Start Computing
Approximate pi: 3.14216348
Total time: 38.98658728599548 seconds

compute_pi_2.py

import time
import random
import numpy as np

def compute_pi(N):
    """
    Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
    Returns estimate of pi
    """
    n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle

    # no for loop because you know python does not like for loops
    p = np.random.rand(N,2) * 2 - 1 
    r2 = p[:,0]* p[:,0] + p[:,1]*p[:,1]  # radius squared, in radial coordinates
    idx = r2 < 1
    return (np.sum(idx) / N) * 4.0

N= 100_000_000
t0 = time.time()
print("Start Computing")
pi = compute_pi(N)
print("Approximate pi: {}".format(pi))
print( "Total time: {} seconds".format(abs(t0- time.time()))) 

Start Computing
Approximate pi: 3.14153556
Total time: 4.595675945281982 seconds

compute_pi.jl

function compute_pi(N::Int)
    """
    Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
    Returns estimate of pi
    """
    n_landed_in_circle = 0  # counts number of points that have radial coordinate < 1, i.e. in circle
    for i = 1:N
        x = rand() * 2 - 1  # uniformly distributed number on x-axis
        y = rand() * 2 - 1  # uniformly distributed number on y-axis

        r2 = x*x + y*y  # radius squared, in radial coordinates
        if r2 < 1.0
            n_landed_in_circle += 1
        end
    end

    return n_landed_in_circle / N * 4.0    
end

compute_pi(2)
N= 100_000_000
t0 = time()
println("Start Computing")
aprox = compute_pi(N)
println("Approximate pi: ", aprox)
println("Total time: ", abs(time()-t0), " seconds")

Start Computing
Approximate pi: π = 3.1415…
Total time: 0.4338340759277344 seconds

8 Likes

About anaconda being faster…
Anaconda´s python is shipped with MKL (https://github.com/conda/conda/issues/2032)
So vectorized code can be send straight to very efficient routines.

Well on latest master its a little bit slower:

julia> @btime TEST_ais(W,1024)
  175.698 ms (28 allocations: 56.35 MiB)

I’m wondering if this should be registered as regression issue on github.

Your function is a bit difficult to optimize, because it’s hard to understand what it’s supposed to be doing, and it has a lot of ‘dead code’.

For example, vvk and hhk are initialized to random matrices, and the first column mutated, but then they are abandoned and never referenced again. Can we just remove that? hh_k is also randomized, but then later replaced with a zero matrix. Can we remove the pointless stuff?

Also, your function returns no values and does not modify anything that is visible outside the function. We could just replace your entire function with this:

function TEST_ais(W, Nsamples)
    return nothing
end

That would be fast.

How can your function be optimized? What do you want it to do?

4 Likes

Here is a function that does something similar to what your code does. It’s somewhat fast:

function TEST_ais(W::AbstractMatrix{T}, Nsamples) where {T<:Real}

    pl_1  = 1
    zer_0 = 0  
    
    Nv, Nh = size(W) .- 1

    Wt = transpose(W)
   
    vv_k = rand([zer_0, pl_1], Nsamples, Nv+1)
    vv_k[:,1] .= pl_1
    
    hh_k = Matrix{T}(undef, Nsamples, Nh+1)
    vv_k = Matrix{T}(undef, Nsamples, Nv+1)
    
    for i in 1:2
        A = -vv_k * W
        hh_k .= 0
        hh_k[1 ./ (1 .+ exp.(A)) .> rand.()] .= pl_1
        
        B = -hh_k * Wt
        vv_k .= 0
        vv_k[1 ./ (1 .+ exp.(B)) .> rand.()] .= pl_1
    end
        
end

In general, though, vectorized array math is fast in Numpy. It’s written in C, after all. You shouldn’t expect Julia to always be faster than that. Same thing with Matlab. It is already highly optimized (and multi-threaded!) in those languages. You should expect Julia to more-or-less match those languages in array math, but otherwise beat them.

3 Likes

Thanks guys for all the comments and suggestions… I have to find a bit of time to swallow all that and see where is the bottleneck in my code, comparing to your solutions. Still I can answer a few of the questions exposed: 1) the code I posted has useless declarations and returns nothing because the original function was way more complicated, and I stripped down parts to reduce it to something simple were timming problems are evident in the comparison 2) there is no need for a function to return anything if you are only timming the calculation performance 3) I wrote vectorized code because, in the past, we were told to do so and use the . syntax as one of the strengths of Julia (hope this has not changed, but I’ve seen so many things chnaging here that I can’t tell what’s the status now) 4) I do not expect Julia to best numpy on MKL libraries, but to perform similarly at least. My Julia code was timming about 2secs, the Anaconda translation took 0.2secs… I don’t think this is a similar performance (at least not good enough for me).

But since I have to check and learn from your solutions, the only thing I can tell you now is: thanks for your help :slight_smile:

I will check the various versions in the next days. Still, if I have to do much complicated work to reach python’s performance, it may not be worth the effort. In general I think programming in Julia is becoming more and more complicated with each new version… I don’t know.

Best regards,

Ferran

MKL is multithreaded …There’s no way serial julia will best that for array ops.

A fair comparison would be using multithreaded cpu array @sdanisch Is there an implementation under GPU array as I recall?

And of course, it should be possible to compile Julia with MKL.

FWIW, I took the function definition from the very first post and benchmarked it without any changes with

  • Julia + OpenBlas
  • Julia + MKL
  • Julia + MKL (with 4 threads)

(Julia is always v0.6.4)

The performance was almost identical, always around ~1.8 s on my machine.

This has not been my experience, quite honestly. With each version I feel everything has grown more consistent, streamlined, performant and automatic. Version 1.0 is so much better than 0.6 y so many ways! The only problem for me was adapting to the new location of everything, as much functionality has now been organised into different packages in the standard library.

4 Likes

I think that 90% of the problem comes from a detail that is easily avoided: Hardcoding types in arrays.

Version posted ---------------------------> Small change
hh_k = zeros(Int32, size_h); ----------> hh_k = zeros(size_h);
vv_k = zeros(Int32, size_v); -----------> vv_k = zeros( size_v);

function TEST_ais(W,Nsamples)

    pl_1  = Int32(1)   # remove (rewritten after)
    zer_0 = Int32(0)   #remove  (rewritten after)
    pl_1  = 1
    zer_0 = 0

    log2  = log(2.0)   #remove (not used)

    #Nv, Nh = size(W) .- 1
    Nv, Nh = size(W) .-1

    Wt = transpose(W)

    vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
    vvk[:,1]  = pl_1
    vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
    vv_k[:,1] = pl_1
    size_v = size(vv_k)

    hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
    hhk[:,1] = pl_1
    hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
    hh_k[:,1] = pl_1
    size_h = size(hh_k)

    for i in 1:2

        hh_k = zeros(size_h); hh_k[1.0./(1.0+exp.(-vv_k*W)) .> rand(size_h)] .= pl_1
        vv_k = zeros(size_v); vv_k[1.0./(1.0+exp.(-hh_k*Wt)) .> rand(size_v)] .= pl_1

    end

end;

W = rand(301,901)

@time TEST_ais(W,1024)

@time TEST_ais(W,1024)

what I get in the version where Int32 is not hardcoded (julia 0.6)

  2.381646 seconds (971.45 k allocations: 197.900 MiB, 10.02% gc time)
  0.587486 seconds (138 allocations: 152.634 MiB, 13.13% gc time)

what I get in the original version (julia 0.6)

  3.491991 seconds (1.02 M allocations: 182.773 MiB, 7.68% gc time)
  1.560260 seconds (156 allocations: 135.029 MiB, 4.70% gc time)

Another little detail: generating arrays

If inside a loop we are refiling an array with zeros (such as in the for loop hh_k = zeros(size_h))
We can preallocate the array hh_k = zeros(size_h) outside the loop and inside the loop do hh_k .= 0. Doing so we do not generate a new array at every iteration of the loop yet the computations are the same.

We can do the same with the random array rand(size_h). First we create an array outside the loop X = rand(size_v) and then do rand!(X) to populate X again with newly generated random numbers at every iteration of the loop (but without generating a new array).

Here you have a very similar function

function TEST_ais(W,Nsamples)

    pl_1  = Int32(1)   # remove (rewritten after)
    zer_0 = Int32(0)   #remove  (rewritten after)
    pl_1  = 1
    zer_0 = 0

    log2  = log(2.0)   #remove (not used)

    #Nv, Nh = size(W) .- 1
    Nv, Nh = size(W) .-1

    Wt = transpose(W)

    vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
    vvk[:,1]  = pl_1
    vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
    vv_k[:,1] = pl_1
    size_v = size(vv_k)

    hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
    hhk[:,1] = pl_1
    hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
    hh_k[:,1] = pl_1
    size_h = size(hh_k)

    random_h = rand(size_h)
    random_v = rand(size_v)
    hh_k = zeros(size_h);
    vv_k = zeros(size_v);

    for i in 1:2
        hh_k[1.0./(1.0.+exp.(-vv_k*W)) .> random_h]  .= pl_1
        vv_k[1.0./(1.0.+exp.(-hh_k*Wt)) .> random_v] .= pl_1

        # populate again the arrays without creating new arrays
        rand!(random_h);    rand!(random_v);  hh_k .= 0. ;   vv_k .= 0.
    end

end;

W = rand(301,901)

@time TEST_ais(W,1024)

@time TEST_ais(W,1024)

and all of a sudden

davids-mbp-2:comparisson david.buchaca$ julia test_ais_ferran_broadcast.jl 
  1.455618 seconds (650.40 k allocations: 120.376 MiB, 12.54% gc time)
  0.193680 seconds (108 allocations: 89.251 MiB, 39.54% gc time)

Easy things to learn from the different versions

  • "Never" write X = zeros(Int32, something), if you end up operating X with something else that has not type Int32 there will be conversions that can hurt performance a lot (and you might not notice the problem right away looking at the code).

  • If you have inside a for loop X = zeros(size(something)) try to do the following:

    • write X = zeros(size(something)) outside the loop
    • inside the loop just write X .= 0.
  • If you have inside a for loop X = rand(size(something)) try to do the following:

    • X =rand(size(something)) outside the loop
    • inside the loop just write rand!(X).
10 Likes

Still slower than Python + NumPy(MKL). The loop in the Python version completes 5 iteration, while julia version does 2. I believe the gap is still mostly about MKL.

Even writting the for loop to up to 5 It runs pretty much at the same speed. 0.28 sec the julia code above vs 0.26 sec the python code.

Some problems with this approach:

  • At some point a compiler is smart enough to strip out dead code. I’m not sure if the Julia layers of parsing is there yet, but I’m certain the underlying LLVM compiler would be capable of doing that. Thus if your not using the values, e.g. replacing the declarations and/or not returning anything from the function and/or not calling anything with side-effects (modifying slow globals or printing etc.), as @DNF mentioned you could just as well have an empty function.
  • You have the random number generator in the time loop! In my experience most random number generators are very slow and I wouldn’t be surprised if that ends up being a significant fraction of your timing. While I understand that for a Monte Carlo the random number generation is probably part of what you want to consider, you also have to remember not all random number generators are equal. Just look at for example https://www.boost.org/doc/libs/1_66_0/doc/html/boost_random/reference.html#boost_random.reference.generators There is a 50 times speed difference between their random generators. Do you know which one is being used by the different languages? Do you care? I like living in the blissful ignorance to pretend that the type of random number generator doesn’t matter for my applications, but if you want to do comparisons and make decisions between languages where this detail could be making a significant difference in the measurement score, you should at least think about whether it matters in you case.
2 Likes

I think that if you call the BLAS then there is no way julia can avoid running the whole code (it cannot know if the code compiled in another language is afecting the program or not).

In this case all the performance problem comes form a tiny detail that is hard to get.

  • vv_k*W will not call the BLAS in this case (because the types of vv_k and W are different)
zer_0  = 0
pl_1   = 1
vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
  • vv_k*W will call the blass in this case (because the types of vv_k and W are the same)
zer_0  = 0.
pl_1   = 1.
vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)

And the time it takes to compute that can be very different

# vv_k*W timing 
N = 3000 same type          0.431971 seconds
N = 3000 different type      36.492838 seconds

Here I did some tests

The random number generation you mention and the rewritting of the code to avoid allocations are important performance tricks you can use but they are irrelevant compared to the BLAS/not BLAS call when matrices contain 1000x1000 elements or more.

1 Like

Thanks for the link. It says mt11213b is fastest (and “good uniform distribution in up to 350 dimensions”) while one of the other Mersenne Twister variants, mt19937, is about as fast (and “good uniform distribution in up to 623 dimensions”).

However, it seems you can do much faster than those. Why I added this issue on it (and there’s also an interesting discussion there on choosing AES, that’s included in hardware on modern CPUs, as Julia’s default):

Python 2 and 3 use “Mersenne Twister as the core generator” (also Julia, same variant? Note, third variant mt19937_64 is only 38% as fast as fastest; I doubt Julia uses, nor know why anyone would) and references:

M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator”, ACM Transactions on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3–30 1998.

You could try this one that’s already been implemented for Julia (just not in its standard library):

http://sunoru.github.io/RandomNumbers.jl/latest/man/xorshifts/

Xoroshiro128Plus is the current best suggestion for replacing other low-quality generators.

If you use the Intel Math Kernel Library (MKL), VSL.jl is also a good choice for random number generation.

1 Like