Quite bad performance of Julia 0.6.4 vs Python+Numpy

pablosanjose · October 31, 2018, 2:08pm

This has not been my experience, quite honestly. With each version I feel everything has grown more consistent, streamlined, performant and automatic. Version 1.0 is so much better than 0.6 y so many ways! The only problem for me was adapting to the new location of everything, as much functionality has now been organised into different packages in the standard library.

davidbp · October 31, 2018, 2:20pm

I think that 90% of the problem comes from a detail that is easily avoided: Hardcoding types in arrays.

Version posted ---------------------------> Small change
hh_k = zeros(Int32, size_h); ----------> hh_k = zeros(size_h);
vv_k = zeros(Int32, size_v); -----------> vv_k = zeros( size_v);

function TEST_ais(W,Nsamples)

    pl_1  = Int32(1)   # remove (rewritten after)
    zer_0 = Int32(0)   #remove  (rewritten after)
    pl_1  = 1
    zer_0 = 0

    log2  = log(2.0)   #remove (not used)

    #Nv, Nh = size(W) .- 1
    Nv, Nh = size(W) .-1

    Wt = transpose(W)

    vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
    vvk[:,1]  = pl_1
    vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
    vv_k[:,1] = pl_1
    size_v = size(vv_k)

    hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
    hhk[:,1] = pl_1
    hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
    hh_k[:,1] = pl_1
    size_h = size(hh_k)

    for i in 1:2

        hh_k = zeros(size_h); hh_k[1.0./(1.0+exp.(-vv_k*W)) .> rand(size_h)] .= pl_1
        vv_k = zeros(size_v); vv_k[1.0./(1.0+exp.(-hh_k*Wt)) .> rand(size_v)] .= pl_1

    end

end;

W = rand(301,901)

@time TEST_ais(W,1024)

@time TEST_ais(W,1024)

what I get in the version where Int32 is not hardcoded (julia 0.6)

  2.381646 seconds (971.45 k allocations: 197.900 MiB, 10.02% gc time)
  0.587486 seconds (138 allocations: 152.634 MiB, 13.13% gc time)

what I get in the original version (julia 0.6)

  3.491991 seconds (1.02 M allocations: 182.773 MiB, 7.68% gc time)
  1.560260 seconds (156 allocations: 135.029 MiB, 4.70% gc time)

Another little detail: generating arrays

If inside a loop we are refiling an array with zeros (such as in the for loop hh_k = zeros(size_h))
We can preallocate the array hh_k = zeros(size_h) outside the loop and inside the loop do hh_k .= 0. Doing so we do not generate a new array at every iteration of the loop yet the computations are the same.

We can do the same with the random array rand(size_h). First we create an array outside the loop X = rand(size_v) and then do rand!(X) to populate X again with newly generated random numbers at every iteration of the loop (but without generating a new array).

Here you have a very similar function

function TEST_ais(W,Nsamples)

    pl_1  = Int32(1)   # remove (rewritten after)
    zer_0 = Int32(0)   #remove  (rewritten after)
    pl_1  = 1
    zer_0 = 0

    log2  = log(2.0)   #remove (not used)

    #Nv, Nh = size(W) .- 1
    Nv, Nh = size(W) .-1

    Wt = transpose(W)

    vvk   = rand([zer_0,pl_1],Nsamples,Nv+1)
    vvk[:,1]  = pl_1
    vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)
    vv_k[:,1] = pl_1
    size_v = size(vv_k)

    hhk   = rand([zer_0,pl_1],Nsamples,Nh+1)
    hhk[:,1] = pl_1
    hh_k  = rand([zer_0,pl_1],Nsamples,Nh+1)
    hh_k[:,1] = pl_1
    size_h = size(hh_k)

    random_h = rand(size_h)
    random_v = rand(size_v)
    hh_k = zeros(size_h);
    vv_k = zeros(size_v);

    for i in 1:2
        hh_k[1.0./(1.0.+exp.(-vv_k*W)) .> random_h]  .= pl_1
        vv_k[1.0./(1.0.+exp.(-hh_k*Wt)) .> random_v] .= pl_1

        # populate again the arrays without creating new arrays
        rand!(random_h);    rand!(random_v);  hh_k .= 0. ;   vv_k .= 0.
    end

end;

W = rand(301,901)

@time TEST_ais(W,1024)

@time TEST_ais(W,1024)

and all of a sudden

davids-mbp-2:comparisson david.buchaca$ julia test_ais_ferran_broadcast.jl 
  1.455618 seconds (650.40 k allocations: 120.376 MiB, 12.54% gc time)
  0.193680 seconds (108 allocations: 89.251 MiB, 39.54% gc time)

Easy things to learn from the different versions

“Never” write X = zeros(Int32, something), if you end up operating X with something else that has not type Int32 there will be conversions that can hurt performance a lot (and you might not notice the problem right away looking at the code).
If you have inside a for loop X = zeros(size(something)) try to do the following:
- write X = zeros(size(something)) outside the loop
- inside the loop just write X .= 0.
If you have inside a for loop X = rand(size(something)) try to do the following:
- X =rand(size(something)) outside the loop
- inside the loop just write rand!(X).

hckr · October 31, 2018, 4:37pm

Still slower than Python + NumPy(MKL). The loop in the Python version completes 5 iteration, while julia version does 2. I believe the gap is still mostly about MKL.

davidbp · October 31, 2018, 5:01pm

Even writting the for loop to up to 5 It runs pretty much at the same speed. 0.28 sec the julia code above vs 0.26 sec the python code.

lwabeke · November 9, 2018, 8:24am

Some problems with this approach:

At some point a compiler is smart enough to strip out dead code. I’m not sure if the Julia layers of parsing is there yet, but I’m certain the underlying LLVM compiler would be capable of doing that. Thus if your not using the values, e.g. replacing the declarations and/or not returning anything from the function and/or not calling anything with side-effects (modifying slow globals or printing etc.), as @DNF mentioned you could just as well have an empty function.
You have the random number generator in the time loop! In my experience most random number generators are very slow and I wouldn’t be surprised if that ends up being a significant fraction of your timing. While I understand that for a Monte Carlo the random number generation is probably part of what you want to consider, you also have to remember not all random number generators are equal. Just look at for example https://www.boost.org/doc/libs/1_66_0/doc/html/boost_random/reference.html#boost_random.reference.generators There is a 50 times speed difference between their random generators. Do you know which one is being used by the different languages? Do you care? I like living in the blissful ignorance to pretend that the type of random number generator doesn’t matter for my applications, but if you want to do comparisons and make decisions between languages where this detail could be making a significant difference in the measurement score, you should at least think about whether it matters in you case.

davidbp · November 9, 2018, 4:25pm

I think that if you call the BLAS then there is no way julia can avoid running the whole code (it cannot know if the code compiled in another language is afecting the program or not).

In this case all the performance problem comes form a tiny detail that is hard to get.

vv_k*W will not call the BLAS in this case (because the types of vv_k and W are different)

zer_0  = 0
pl_1   = 1
vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)

vv_k*W will call the blass in this case (because the types of vv_k and W are the same)

zer_0  = 0.
pl_1   = 1.
vv_k  = rand([zer_0,pl_1],Nsamples,Nv+1)

And the time it takes to compute that can be very different

# vv_k*W timing 
N = 3000 same type          0.431971 seconds
N = 3000 different type      36.492838 seconds

Here I did some tests

The random number generation you mention and the rewritting of the code to avoid allocations are important performance tricks you can use but they are irrelevant compared to the BLAS/not BLAS call when matrices contain 1000x1000 elements or more.

Palli · November 13, 2018, 5:48pm

Thanks for the link. It says mt11213b is fastest (and “good uniform distribution in up to 350 dimensions”) while one of the other Mersenne Twister variants, mt19937, is about as fast (and “good uniform distribution in up to 623 dimensions”).

However, it seems you can do much faster than those. Why I added this issue on it (and there’s also an interesting discussion there on choosing AES, that’s included in hardware on modern CPUs, as Julia’s default):

github.com/JuliaLang/julia

Better default RNG in the future?

opened 05:00PM - 16 Jun 18 UTC

closed 06:25AM - 02 Jun 21 UTC

PallHaraldsson

RNG

EDIT: "August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this …article was published, its authors have supplanted it with xoshiro256**. It has essentially the same performance, but better statistical properties. **xoshiro256** is now my preferred PRNG**." I previously quoted: <s>"The clear winner is xoroshiro128+"</s> https://nullprogram.com/blog/2017/09/21/ >Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. .. > >Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

Python 2 and 3 use “Mersenne Twister as the core generator” (also Julia, same variant? Note, third variant mt19937_64 is only 38% as fast as fastest; I doubt Julia uses, nor know why anyone would) and references:

M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator”, ACM Transactions on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3–30 1998.

You could try this one that’s already been implemented for Julia (just not in its standard library):

http://sunoru.github.io/RandomNumbers.jl/latest/man/xorshifts/

Xoroshiro128Plus is the current best suggestion for replacing other low-quality generators.

If you use the Intel Math Kernel Library (MKL), VSL.jl is also a good choice for random number generation.

Topic		Replies	Views
Why is this simple function twice as slow as its Python version Performance question	97	4435	April 12, 2021
General questions from Python user Performance	59	4326	March 8, 2021
Optimized Python is as good as Julia Performance question	28	25423	June 2, 2025
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10996	October 15, 2019
Performance of Julia as a general-purpose programming language Performance	30	7644	November 19, 2017

Quite bad performance of Julia 0.6.4 vs Python+Numpy

Another little detail: generating arrays

Easy things to learn from the different versions

Related topics