Best way to optimize code

This is very interesting, I hadn’t seen the inbounds functionality before. Really useful thanks!

This is very true (in particular for Float64), so I don’t see how allocating an array with [rand(T) for _ = 1:n] can lead to faster code than calling rand(T) on demand.

Something you might try is to pass explicitly an rng (MersenneTwister instance) in calls to rand, as since Julia 1.3, using the default implicit RNG has a measurable overhead (due to having to be thread-safe).

1 Like

FYI: I great way to profile, with flame graphs, is GitHub - timholy/ProfileView.jl: Visualization of Julia profiling data

What version of Julia did you use?

I got 0.66 sec on Julia 1.0.5 but 0.195 sec for word_sim3. Julia 1.3 had some known rand speed regression, and that issue is still open.

I didn’t match your speed, your PC seems just faster (I could just match it with word_sim3, then best 0.233 sec on Julia 1.5, not as good as Julia 1.0.5).

FYI: I couldn’t generate integer random numbers with (not meant too, or just not possible yet?), but for floating-point should be faster:

For my Julia 1.5-DEV (while not latest nightly) I got similar speed on the original code to Julia 1.4.0 (while less penalty for the first run), but if anything slightly faster for Julia 1.4.0 (didn’t test 1.4.1):

0.416916 seconds (798.10 k allocations: 164.778 MiB, 1.80% gc time)

for first run: 0.651927 seconds (865.73 k allocations: 168.292 MiB, 25.31% gc time)

For word_sim3:
0.233139 seconds (498.10 k allocations: 222.762 MiB, 2.88% gc time)

Your first problem is that you’re benchmarking in global scope with non-const globals. Make sure to wrap your code in a function and benchmark that using the BenchmarkTools package.

As for the tests and innovates vectors, you don’t need to allocate them anywhere, inside or outside the function. Just write if mu > rand() each time. No need for preallocation.

As a matter of style, I think you should use vectors instead of matrices, that is,

words = zeros(Int32, omega)

instead of

unless there’s some particular reason for that choice.

5 Likes

The performance improvements on rand have not yet been merged into master.

1 Like

While I took a look at the, not yet merged, PR for the algorithm (and didn’t actually locate any division used…):

The currently used algorithm understandably has differing speed depending on the exact range used [EDIT: thanks to @rfourquet for $rng tip that makes my way only relatively stronger]:

rng = MersenneTwister();

julia> @btime rand($rng, 0:2)
  11.120 ns (0 allocations: 0 bytes)

julia> @btime rand($rng, 1:3)
  11.108 ns (0 allocations: 0 bytes)

# before without $rng:
15.286 ns (0 allocations: 0 bytes)

julia> @btime rand($rng, 0:3)
  6.728 ns (0 allocations: 0 bytes)

julia> @btime rand(0:3)
  10.464 ns (0 allocations: 0 bytes)

I looked a bit into what’s currently possible without the algorithm, at least if you can handle bias, and got more than twice the speed for the same range:

julia> @btime (trunc(Int64, rand($rng, UInt16)*(1/3)))>>13
  5.243 ns (0 allocations: 0 bytes)

julia> @btime 1+(trunc(Int64, rand($rng, UInt16)*(1/3)))>>13
  5.607 ns (0 allocations: 0 bytes)

# without $rng:
  8.968 ns (0 allocations: 0 bytes)


julia> rA = collect([1+(trunc(Int64, rand(UInt16)*(1/3)))>>13 for i in 1:10000000]);

# However, no bias for e.g. 1:2 or 1:4 range, and still a bit faster:
julia> rB = collect([1+(trunc(Int64, rand(UInt16)*(1/2)))>>(14) for i in 1:10000000]);
julia> rC = collect([1+(rand(UInt16)>>(8+6)) for i in 1:10000000]);


julia> count(i->i==3, rA)
2501880

julia> count(i->i==2, rA)
3749129

julia> count(i->i==1, rA)
3748991

I missed this:

seed = [0:1:(omega-1);]; 

Don’t do this, do

seed = 0:(omega-1)

Collecting ranges into vectors is something you should almost never do. This is actually my pet peeve :grin:

4 Likes

Frankly, I don’t think you need to create even the words array. You just need to repeatedly do a random test, check if it satisfies the probability of getting omega, and then recalculate the probability of getting omega the next time. Then keep track of the count. I don’t think you need any vector at all.

1 Like

If you are interested in previous discussions, you can have a look at random sampling from an (abstract)array is slow · Issue #20582 · JuliaLang/julia · GitHub and rand(1:10) vs Int(round(10*rand()) (and maybe RFC: implement a biased rand(a:b) (fix #20582) by rfourquet · Pull Request #28987 · JuliaLang/julia · GitHub which implements a biased rand(a:b)).

Also, I don’t know which version of julia you are using, but on 1.3 and more, I will repeat myself by recommending to use an explicit RNG when you benchmark, if what you want to measure is the performance of the algorithm to sample from a range, i.e. @btime rand($rng, 1:3) and not @btime rand(1:3) where rng = MersenneTwister();.

1 Like

Thanks, my rand method (in other comment) is a bit faster, as it’s without floating point, and sometimes (but sometimes not) unbiased. I however took that other claimed always unbiased version and made the fastest variant yet (on my machine);

rng = MersenneTwister();

julia> @btime begin
[..]
               pops[jj] = word_sim4($rng, tau, omega, mu);
[..]
  189.741 ms (500003 allocations: 222.79 MiB)


# I changed one line (two with the first line, I wasn't sure how
# to do otherwise) from word_sim3 while I doubt this pre-allocated
# rand array scheme will be fastest in the future

julia> function word_sim4(rng, tau::Int, omega::Int, mu::Float64)
           # inserts a word in position (tau+1), at each point creates a new word with prob mu
           # otherwise randomly chooses a previously used. Runs the program until time omega

           words = Array{Int32}(undef, omega) # to store the words
           tests = rand(omega) # will compare mu to these
           words[1] = 1; # initialize the words
           next_word = 2 # will be the next word used
           words[tau+1] = omega + 1; # max possible word so insert that at time tau
           rand0 = [1 + floor(Int, rand(rng)*i) for i=1:omega-1] # precalc. random numbers
           @inbounds for i = 2:tau # simulate the process
               if mu > tests[i] # innovate 
                   words[i] = next_word
                   next_word = next_word + 1
               else # copy
                   #words[i] = words[rand(1:(i-1))]
                   words[i] = words[rand0[i-1]]
               end
           end
           # force the word we're interested in
           @inbounds for i = (tau+2):omega
               if mu > tests[i] # innovate 
                   words[i] = next_word
                   next_word = next_word + 1
               else # copy
                   #words[i] = words[rand(1:(i-1))]
                   words[i] = words[rand0[i-1]]
               end
           end
           result = count(==(omega + 1), words) # count how many times our word occurred
           return result
       end
word_sim4 (generic function with 1 method)
1 Like

I haven’t checked this really well, but I think it should be close to correct:

function word_sim_alternative(τ, ω, μ)
    wordcount = 1
    for i in (τ+2):ω
        wordcount += (i * rand()) < (1-μ) * wordcount
    end
    return wordcount
end

This is fantastic! I didn’t realise random number generation can effect speed so much.

This is a really neat function. At some points I will need the entire array of words but in general this should be good for most of them. Thanks so much.

Still, my latest change to the program, with the help of that RNG package, more that doubles [EDIT: almost triples, if I recall old number] the speed (what is it on your faster machine?) on top of the best so far, and I think could be made better by avoiding converting (from int to float, in that package, and back) to int:

julia> using BenchmarkTools, ProfileView, VectorizedRNG  # note,
# version v0.1.6, master version is broken, so use the released one
[ Info: Precompiling VectorizedRNG [33b4df10-0173-11e9-2a0c-851a7edac40e]

rng = local_rng() 

julia> @btime begin
[..]
               pops[jj] = word_sim5(tau, omega, mu);
[..]
  63.522 ms (300002 allocations: 218.21 MiB)

@profview begin ... now shows the bottleneck moved

For:

julia> function word_sim5(tau::Int, omega::Int, mu::Float64)
                  # inserts a word in position (tau+1), at each point creates a new word with prob mu
                  # otherwise randomly chooses a previously used. Runs the program until time omega
                  words = Array{Int32}(undef, omega) # to store the words
                  tests = rand(local_rng(), omega) # will compare mu to these
                  words[1] = 1; # initialize the words
                  next_word = 2 # will be the next word used
                  words[tau+1] = omega + 1; # max possible word so insert that at time tau
                  rand0 = Vector{Float64}(undef, omega) # one extra element, by design
                  rand!(local_rng(), rand0)  # precalc. float random numbers
                  @inbounds for i = 2:tau # simulate the process
                      if mu > tests[i] # innovate 
                          words[i] = next_word
                          next_word = next_word + 1
                      else # copy
                          words[i] = words[1 + floor(Int, rand0[i])]
                      end
                  end
                  # force the word we're interested in
                  @inbounds for i = (tau+2):omega
                      if mu > tests[i] # innovate 
                          words[i] = next_word
                          next_word = next_word + 1
                      else # copy
                          words[i] = words[1 + floor(Int, rand0[i])]
                      end
                  end
                  result = count(==(omega + 1), words) # count how many times our word occurred
                  return result
              end
1 Like

If that happens, it might be enough to just store the location indices of that word, and you still wouldn’t need those locations for the assignment of the next word.

Great!

2 Likes

Am I not reading the profile right?

By now the single slowest line is “word_sim5: line 4”, about 20%+ (but three rand lines combined, 5, 9 and 16, about 75% of the time).

Do the numbers for that line make sense (it seems already fast, and with “undef” should be O(1), as not initializing):

julia> @time words = Array{Int32}(undef, omega);
  0.000003 seconds (1 allocation: 496 bytes)

julia> omega = 100000
100000

julia> @time words = Array{Int32}(undef, omega);
  0.000008 seconds (2 allocations: 390.703 KiB)

julia> omega = 1000000000
1000000000

julia> @time words = Array{Int32}(undef, omega);
  0.012678 seconds (2 allocations: 3.725 GiB, 99.61% gc time)

julia> @time words = Array{Int32}(undef, omega);
  0.004373 seconds (2 allocations: 3.725 GiB, 99.06% gc time)

julia> @time words = Array{Int32}(undef, omega);
  0.047590 seconds (2 allocations: 3.725 GiB, 99.92% gc time)

For those extra test I did (unrelated to the program, really), I’m exercising the GC [with @btime I do see the same allocations, but “gc time” is gone, as it takes care of avoiding GC], but why? It seems I’m testing Clibs malloc, and it’s not O(1), even then why GC?

[Funny story, I have 128 GB in mem. so the computer is not swapping, I thought I had 12 GB, and was told, “that can’t be right, we didn’t give you only 12 GB”, I had never seen the + before in top so disregarded a digit cut off:

KiB Mem : 13182060+total, 10671788+free, 14977804 used, 10124916 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 11454323+avail Mem ]

Wow, I’m hitting .9 seconds with this version (down from .26) really impressive. I’ll need to look into the VectorizedRNG package (does it have any of the previously mentioned bias issues for example)?

There’s an open issue for integer support. I’ll try and get around to adding it soon.

I don’t think so, but note I wasn’t using it directly, I was however using a claimed non-biased workaround for speed of ranges. That package is based on the work here (what I recommended for new RNG in Julia, as MersenneTwister() is known to be slower): http://prng.di.unimi.it/

Yes, I added that issue, after looking into optimizing the program in this thread. I could make a PR based on the work I did here, to make simplifying the program here possible. But doing the int->float (your packages does) and back to int, wouldn’t be the best way, while could be optimized later. Then however you would get a different stream of random numbers. I guess you have no guarantee, so that’s ok.

Other package that DOES work for integers, and is faster but last I checked surprisingly wasn’t faster than Julia Base:

Sorry, sonuru’s package is actually very fast, faster than Base. I made a timing mistake. I’m close the my best version here with that package, and only older version uses vectorization. I’m doing:

julia> using RandomNumbers.Xorshifts

julia> r = Xoroshiro128Plus();

And I’m not yet using, now merged in Julia 1.6-DEV:

1 Like