Best way to optimize code

I’m making my first effort to move from Matlab to Julia and have found my code to improve by ~3x but still think there is more to come, I’m not using any global variables in the function and have preallocated all the arrays used (I think?). If there was any thoughts on how it could be sped up even further it would be greatly appreciated, I’ll fully convert even at the current improvement I think!

function word_sim(tau::Int, omega::Int, mu::Float64)
# inserts a word in position (tau+1), at each point creates a new word with prob mu
# otherwise randomly chooses a previously used. Runs the program until time omega

words = zeros(Int32, 1, omega) # to store the words
tests = rand(1,omega) # will compare mu to these
words[1] = 1; # initialize the words
next_word = 2 # will be the next word used
words[tau+1] = omega + 1; # max possible word so insert that at time tau
innovates = mu .> tests; # when we'll make a new word
for i = 2:tau # simulate the process
    if innovates[i] == 1 # innovate 
        words[i] = next_word
        next_word = next_word + 1
    else # copy
        words[i] = words[rand(1:(i-1))]
    end
end
# force the word we're interested in
for i = (tau+2):omega
    if innovates[i] == 1 # innovate 
        words[i] = next_word
        next_word = next_word + 1
    else # copy
        words[i] = words[rand(1:(i-1))]
    end
end
result = sum(words .== (omega + 1)); # count how many times our word occurred
return result
end

and when I run it with these values it takes ~.26 seconds on my PC

using Statistics
@time begin
nsim = 10^3;
omega = 100;
seed = [0:1:(omega-1);]; 
mu = 0.01; 

results = zeros(Float64, 1, length(seed));
pops = zeros(Int64, 1, nsim);
for tau in seed
    for jj = 1:nsim
        pops[jj] = word_sim(tau, omega, mu);
    end
    results[tau+1] = mean(pops);
end
end

Or perhaps I’d be better writing the code in C++? Julia was my first reaction as I’ve heard rave reviews about its syntax, which to be honest is fantastic!

Any comments greatly appreciated.

1 Like

You could replace sum(words .== (omega + 1)) with count(==(omega + 1), words), which shaves another 40 ms off for me. For benchmarking, you might want to replace @time with @btime from BenchmarkTools.jl, which will exclude compilation time and build an average. Otherwise your code looks pretty good, I don’t see much more obvious optimizations left on the table, but am glad to be proven wrong.

3 Likes

I would say just if innovates[i] would be slightly better style than if innovates[i] == 1 since it’s an array of bools, but the performance implications are probably minimal.

1 Like

You could cut the allocations even more(although I’m not sure you will gain much speed) by keeping seed as range ( no square brackets) and preallocate with Array{Int64,1)(undef,dim). Keep in mind the allocated array will contain gibberish;).

2 Likes

You get another 60 ms, if you don’t allocate innovates:

function word_sim2(tau::Int, omega::Int, mu::Float64)
    # inserts a word in position (tau+1), at each point creates a new word with prob mu
    # otherwise randomly chooses a previously used. Runs the program until time omega
    
    words = Array{Int32}(undef, omega) # to store the words
    tests = rand(omega) # will compare mu to these
    words[1] = 1; # initialize the words
    next_word = 2 # will be the next word used
    words[tau+1] = omega + 1; # max possible word so insert that at time tau
    for i = 2:tau # simulate the process
        if mu > tests[i] # innovate 
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            words[i] = words[rand(1:(i-1))]
        end
    end
    # force the word we're interested in
    for i = (tau+2):omega
        if mu > tests[i] # innovate 
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            words[i] = words[rand(1:(i-1))]
        end
    end
    result = count(==(omega + 1), words) # count how many times our word occurred
    return result
end
3 Likes

There is no need to allocate new words and tests arrays for each call, i.e. construct them before the test loops and send them in as arguments to word_sim. (tests can be filled with Random.rand!)

More radically the first of the two loops in word_sim has no effect on the output and can be removed.

You can try but chances are that you would get similar speed in Julia if you write the code in the same way.

4 Likes

Many thanks for all the helpful comments, they are greatly appreciated!

Hi there! Seems like your function spends about 50% with the RNG. Precalculating the random numbers and using @inbounds removes another 25ms on my system (although we calculate the random number even when not needed).
Please verify if my indexing is correct (probably not). You can then get more performance by skipping the unneeded RNG operations (mu > tests[i] is known before the loop)


function word_sim3(tau::Int, omega::Int, mu::Float64)
    # inserts a word in position (tau+1), at each point creates a new word with prob mu
    # otherwise randomly chooses a previously used. Runs the program until time omega
    
    words = Array{Int32}(undef, omega) # to store the words
    tests = rand(omega) # will compare mu to these
    words[1] = 1; # initialize the words
    next_word = 2 # will be the next word used
    words[tau+1] = omega + 1; # max possible word so insert that at time tau
    rand0 = [rand(UInt32)%(i-1)+1 for i=2:omega] # precalc. random numbers
    @inbounds for i = 2:tau # simulate the process
        if mu > tests[i] # innovate 
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            #words[i] = words[rand(1:(i-1))]
            words[i] = words[rand0[i-1]]
        end
    end
    # force the word we're interested in
    @inbounds for i = (tau+2):omega
        if mu > tests[i] # innovate 
            words[i] = next_word
            next_word = next_word + 1
        else # copy
            #words[i] = words[rand(1:(i-1))]
            words[i] = words[rand0[i-1]]
        end
    end
    result = count(==(omega + 1), words) # count how many times our word occurred
    return result
end

My understanding is the default MersenneTwister RNG is more efficient if you use the rand(T, n) methods which uses the dSFMT algorithm (rather than something like [rand(T) for _ = 1:n]). So here it could be even faster to do rand0 = rand(UInt32, omega) and then do the % stuff after (either right away or where it gets used).

2 Likes

It’s worth noting that 1.5 will likely have a faster randomness in general and random ranges in particular. See https://github.com/JuliaLang/julia/pull/34852 and https://github.com/JuliaLang/julia/pull/29240

2 Likes

This is very interesting, I hadn’t seen the inbounds functionality before. Really useful thanks!

This is very true (in particular for Float64), so I don’t see how allocating an array with [rand(T) for _ = 1:n] can lead to faster code than calling rand(T) on demand.

Something you might try is to pass explicitly an rng (MersenneTwister instance) in calls to rand, as since Julia 1.3, using the default implicit RNG has a measurable overhead (due to having to be thread-safe).

1 Like

FYI: I great way to profile, with flame graphs, is https://github.com/timholy/ProfileView.jl

What version of Julia did you use?

I got 0.66 sec on Julia 1.0.5 but 0.195 sec for word_sim3. Julia 1.3 had some known rand speed regression, and that issue is still open.

I didn’t match your speed, your PC seems just faster (I could just match it with word_sim3, then best 0.233 sec on Julia 1.5, not as good as Julia 1.0.5).

FYI: I couldn’t generate integer random numbers with (not meant too, or just not possible yet?), but for floating-point should be faster:

For my Julia 1.5-DEV (while not latest nightly) I got similar speed on the original code to Julia 1.4.0 (while less penalty for the first run), but if anything slightly faster for Julia 1.4.0 (didn’t test 1.4.1):

0.416916 seconds (798.10 k allocations: 164.778 MiB, 1.80% gc time)

for first run: 0.651927 seconds (865.73 k allocations: 168.292 MiB, 25.31% gc time)

For word_sim3:
0.233139 seconds (498.10 k allocations: 222.762 MiB, 2.88% gc time)

Your first problem is that you’re benchmarking in global scope with non-const globals. Make sure to wrap your code in a function and benchmark that using the BenchmarkTools package.

As for the tests and innovates vectors, you don’t need to allocate them anywhere, inside or outside the function. Just write if mu > rand() each time. No need for preallocation.

As a matter of style, I think you should use vectors instead of matrices, that is,

words = zeros(Int32, omega)

instead of

unless there’s some particular reason for that choice.

5 Likes

The performance improvements on rand have not yet been merged into master.

1 Like

While I took a look at the, not yet merged, PR for the algorithm (and didn’t actually locate any division used…):

The currently used algorithm understandably has differing speed depending on the exact range used [EDIT: thanks to @rfourquet for $rng tip that makes my way only relatively stronger]:

rng = MersenneTwister();

julia> @btime rand($rng, 0:2)
  11.120 ns (0 allocations: 0 bytes)

julia> @btime rand($rng, 1:3)
  11.108 ns (0 allocations: 0 bytes)

# before without $rng:
15.286 ns (0 allocations: 0 bytes)

julia> @btime rand($rng, 0:3)
  6.728 ns (0 allocations: 0 bytes)

julia> @btime rand(0:3)
  10.464 ns (0 allocations: 0 bytes)

I looked a bit into what’s currently possible without the algorithm, at least if you can handle bias, and got more than twice the speed for the same range:

julia> @btime (trunc(Int64, rand($rng, UInt16)*(1/3)))>>13
  5.243 ns (0 allocations: 0 bytes)

julia> @btime 1+(trunc(Int64, rand($rng, UInt16)*(1/3)))>>13
  5.607 ns (0 allocations: 0 bytes)

# without $rng:
  8.968 ns (0 allocations: 0 bytes)


julia> rA = collect([1+(trunc(Int64, rand(UInt16)*(1/3)))>>13 for i in 1:10000000]);

# However, no bias for e.g. 1:2 or 1:4 range, and still a bit faster:
julia> rB = collect([1+(trunc(Int64, rand(UInt16)*(1/2)))>>(14) for i in 1:10000000]);
julia> rC = collect([1+(rand(UInt16)>>(8+6)) for i in 1:10000000]);


julia> count(i->i==3, rA)
2501880

julia> count(i->i==2, rA)
3749129

julia> count(i->i==1, rA)
3748991

I missed this:

seed = [0:1:(omega-1);]; 

Don’t do this, do

seed = 0:(omega-1)

Collecting ranges into vectors is something you should almost never do. This is actually my pet peeve :grin:

4 Likes

Frankly, I don’t think you need to create even the words array. You just need to repeatedly do a random test, check if it satisfies the probability of getting omega, and then recalculate the probability of getting omega the next time. Then keep track of the count. I don’t think you need any vector at all.

1 Like

If you are interested in previous discussions, you can have a look at https://github.com/JuliaLang/julia/issues/20582 and rand(1:10) vs Int(round(10*rand()) (and maybe https://github.com/JuliaLang/julia/pull/28987 which implements a biased rand(a:b)).

Also, I don’t know which version of julia you are using, but on 1.3 and more, I will repeat myself by recommending to use an explicit RNG when you benchmark, if what you want to measure is the performance of the algorithm to sample from a range, i.e. @btime rand($rng, 1:3) and not @btime rand(1:3) where rng = MersenneTwister();.

1 Like

Thanks, my rand method (in other comment) is a bit faster, as it’s without floating point, and sometimes (but sometimes not) unbiased. I however took that other claimed always unbiased version and made the fastest variant yet (on my machine);

rng = MersenneTwister();

julia> @btime begin
[..]
               pops[jj] = word_sim4($rng, tau, omega, mu);
[..]
  189.741 ms (500003 allocations: 222.79 MiB)


# I changed one line (two with the first line, I wasn't sure how
# to do otherwise) from word_sim3 while I doubt this pre-allocated
# rand array scheme will be fastest in the future

julia> function word_sim4(rng, tau::Int, omega::Int, mu::Float64)
           # inserts a word in position (tau+1), at each point creates a new word with prob mu
           # otherwise randomly chooses a previously used. Runs the program until time omega

           words = Array{Int32}(undef, omega) # to store the words
           tests = rand(omega) # will compare mu to these
           words[1] = 1; # initialize the words
           next_word = 2 # will be the next word used
           words[tau+1] = omega + 1; # max possible word so insert that at time tau
           rand0 = [1 + floor(Int, rand(rng)*i) for i=1:omega-1] # precalc. random numbers
           @inbounds for i = 2:tau # simulate the process
               if mu > tests[i] # innovate 
                   words[i] = next_word
                   next_word = next_word + 1
               else # copy
                   #words[i] = words[rand(1:(i-1))]
                   words[i] = words[rand0[i-1]]
               end
           end
           # force the word we're interested in
           @inbounds for i = (tau+2):omega
               if mu > tests[i] # innovate 
                   words[i] = next_word
                   next_word = next_word + 1
               else # copy
                   #words[i] = words[rand(1:(i-1))]
                   words[i] = words[rand0[i-1]]
               end
           end
           result = count(==(omega + 1), words) # count how many times our word occurred
           return result
       end
word_sim4 (generic function with 1 method)
1 Like