This has not been my experience, quite honestly. With each version I feel everything has grown more consistent, streamlined, performant and automatic. Version 1.0 is so much better than 0.6 y so many ways! The only problem for me was adapting to the new location of everything, as much functionality has now been organised into different packages in the standard library.
I think that 90% of the problem comes from a detail that is easily avoided: Hardcoding types in arrays.
Version posted ---------------------------> Small change
hh_k = zeros(Int32, size_h); ----------> hh_k = zeros(size_h);
vv_k = zeros(Int32, size_v); -----------> vv_k = zeros( size_v);
function TEST_ais(W,Nsamples) pl_1 = Int32(1) # remove (rewritten after) zer_0 = Int32(0) #remove (rewritten after) pl_1 = 1 zer_0 = 0 log2 = log(2.0) #remove (not used) #Nv, Nh = size(W) .- 1 Nv, Nh = size(W) .-1 Wt = transpose(W) vvk = rand([zer_0,pl_1],Nsamples,Nv+1) vvk[:,1] = pl_1 vv_k = rand([zer_0,pl_1],Nsamples,Nv+1) vv_k[:,1] = pl_1 size_v = size(vv_k) hhk = rand([zer_0,pl_1],Nsamples,Nh+1) hhk[:,1] = pl_1 hh_k = rand([zer_0,pl_1],Nsamples,Nh+1) hh_k[:,1] = pl_1 size_h = size(hh_k) for i in 1:2 hh_k = zeros(size_h); hh_k[1.0./(1.0+exp.(-vv_k*W)) .> rand(size_h)] .= pl_1 vv_k = zeros(size_v); vv_k[1.0./(1.0+exp.(-hh_k*Wt)) .> rand(size_v)] .= pl_1 end end; W = rand(301,901) @time TEST_ais(W,1024) @time TEST_ais(W,1024)
what I get in the version where Int32 is not hardcoded (julia 0.6)
2.381646 seconds (971.45 k allocations: 197.900 MiB, 10.02% gc time) 0.587486 seconds (138 allocations: 152.634 MiB, 13.13% gc time)
what I get in the original version (julia 0.6)
3.491991 seconds (1.02 M allocations: 182.773 MiB, 7.68% gc time) 1.560260 seconds (156 allocations: 135.029 MiB, 4.70% gc time)
Another little detail: generating arrays
If inside a loop we are refiling an array with zeros (such as in the for loop
hh_k = zeros(size_h))
We can preallocate the array
hh_k = zeros(size_h) outside the loop and inside the loop do
hh_k .= 0. Doing so we do not generate a new array at every iteration of the loop yet the computations are the same.
We can do the same with the random array
rand(size_h). First we create an array outside the loop
X = rand(size_v) and then do
rand!(X) to populate
X again with newly generated random numbers at every iteration of the loop (but without generating a new array).
Here you have a very similar function
function TEST_ais(W,Nsamples) pl_1 = Int32(1) # remove (rewritten after) zer_0 = Int32(0) #remove (rewritten after) pl_1 = 1 zer_0 = 0 log2 = log(2.0) #remove (not used) #Nv, Nh = size(W) .- 1 Nv, Nh = size(W) .-1 Wt = transpose(W) vvk = rand([zer_0,pl_1],Nsamples,Nv+1) vvk[:,1] = pl_1 vv_k = rand([zer_0,pl_1],Nsamples,Nv+1) vv_k[:,1] = pl_1 size_v = size(vv_k) hhk = rand([zer_0,pl_1],Nsamples,Nh+1) hhk[:,1] = pl_1 hh_k = rand([zer_0,pl_1],Nsamples,Nh+1) hh_k[:,1] = pl_1 size_h = size(hh_k) random_h = rand(size_h) random_v = rand(size_v) hh_k = zeros(size_h); vv_k = zeros(size_v); for i in 1:2 hh_k[1.0./(1.0.+exp.(-vv_k*W)) .> random_h] .= pl_1 vv_k[1.0./(1.0.+exp.(-hh_k*Wt)) .> random_v] .= pl_1 # populate again the arrays without creating new arrays rand!(random_h); rand!(random_v); hh_k .= 0. ; vv_k .= 0. end end; W = rand(301,901) @time TEST_ais(W,1024) @time TEST_ais(W,1024)
and all of a sudden
davids-mbp-2:comparisson david.buchaca$ julia test_ais_ferran_broadcast.jl 1.455618 seconds (650.40 k allocations: 120.376 MiB, 12.54% gc time) 0.193680 seconds (108 allocations: 89.251 MiB, 39.54% gc time)
Easy things to learn from the different versions
X = zeros(Int32, something), if you end up operating
Xwith something else that has not type Int32 there will be conversions that can hurt performance a lot (and you might not notice the problem right away looking at the code).
If you have inside a for loop
X = zeros(size(something))try to do the following:
X = zeros(size(something))outside the loop
- inside the loop just write
X .= 0.
If you have inside a for loop
X = rand(size(something))try to do the following:
X =rand(size(something))outside the loop
- inside the loop just write
Still slower than Python + NumPy(MKL). The loop in the Python version completes 5 iteration, while julia version does 2. I believe the gap is still mostly about MKL.
Even writting the for loop to up to 5 It runs pretty much at the same speed. 0.28 sec the julia code above vs 0.26 sec the python code.
Some problems with this approach:
- At some point a compiler is smart enough to strip out dead code. I’m not sure if the Julia layers of parsing is there yet, but I’m certain the underlying LLVM compiler would be capable of doing that. Thus if your not using the values, e.g. replacing the declarations and/or not returning anything from the function and/or not calling anything with side-effects (modifying slow globals or printing etc.), as @DNF mentioned you could just as well have an empty function.
- You have the random number generator in the time loop! In my experience most random number generators are very slow and I wouldn’t be surprised if that ends up being a significant fraction of your timing. While I understand that for a Monte Carlo the random number generation is probably part of what you want to consider, you also have to remember not all random number generators are equal. Just look at for example https://www.boost.org/doc/libs/1_66_0/doc/html/boost_random/reference.html#boost_random.reference.generators There is a 50 times speed difference between their random generators. Do you know which one is being used by the different languages? Do you care? I like living in the blissful ignorance to pretend that the type of random number generator doesn’t matter for my applications, but if you want to do comparisons and make decisions between languages where this detail could be making a significant difference in the measurement score, you should at least think about whether it matters in you case.
I think that if you call the BLAS then there is no way julia can avoid running the whole code (it cannot know if the code compiled in another language is afecting the program or not).
In this case all the performance problem comes form a tiny detail that is hard to get.
vv_k*Wwill not call the BLAS in this case (because the types of vv_k and W are different)
zer_0 = 0 pl_1 = 1 vv_k = rand([zer_0,pl_1],Nsamples,Nv+1)
vv_k*Wwill call the blass in this case (because the types of vv_k and W are the same)
zer_0 = 0. pl_1 = 1. vv_k = rand([zer_0,pl_1],Nsamples,Nv+1)
And the time it takes to compute that can be very different
# vv_k*W timing N = 3000 same type 0.431971 seconds N = 3000 different type 36.492838 seconds
Here I did some tests
The random number generation you mention and the rewritting of the code to avoid allocations are important performance tricks you can use but they are irrelevant compared to the BLAS/not BLAS call when matrices contain 1000x1000 elements or more.
Thanks for the link. It says
mt11213b is fastest (and “good uniform distribution in up to 350 dimensions”) while one of the other Mersenne Twister variants,
mt19937, is about as fast (and “good uniform distribution in up to 623 dimensions”).
However, it seems you can do much faster than those. Why I added this issue on it (and there’s also an interesting discussion there on choosing AES, that’s included in hardware on modern CPUs, as Julia’s default):
Python 2 and 3 use “Mersenne Twister as the core generator” (also Julia, same variant? Note, third variant
mt19937_64 is only 38% as fast as fastest; I doubt Julia uses, nor know why anyone would) and references:
M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator”, ACM Transactions on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3–30 1998.
You could try this one that’s already been implemented for Julia (just not in its standard library):
Xoroshiro128Plusis the current best suggestion for replacing other low-quality generators.
If you use the Intel Math Kernel Library (MKL),
VSL.jlis also a good choice for random number generation.