Effective simulation of putting n-1 balls in n boxes uniformly at random

Xing_Shi_Cai · May 2, 2019, 12:30pm

I am trying to learn Julia so I wrote a bit code to test its performance.
In the following code snippet I created a function balls_in_box to simulate putting n-1 balls in n boxes uniformly at random.
One my machine, on average the function takes 600ms for 10^7 boxes.
Is there more space for improvement of the speed?

using Random
using BenchmarkTools

Random.seed!(1234)

function balls_in_box(n::Int)
    box_seq::Vector{Int32} = zeros(Int32, n)
    for i::Int in 1:(n-1)
        box::Int = rand(1:n)
        box_seq[box]+=1
    end
    box_seq
end

t = @benchmark balls_in_box(10^7)
show(stdout, "text/plain", t)
println()

stevengj · May 2, 2019, 12:50pm

Note that your type declarations don’t make anything faster. The function

function balls_in_box(n)
    box_seq = zeros(Int32, n)
    for i = 1:n-1
        box = rand(1:n)
        box_seq[box] += 1
    end
    box_seq
end

will run with exactly the same speed. When you call balls_in_box(10^7), Julia automatically compiles a specialized version for n::Int, and it is able to infer all of the other types already.

In your problem it seems that virtually all of the time is spent in random-number generation, so the only way to speed it up significantly is to make rand faster. This is not impossible. For example, if you allocate a small buffer and generate a batch of random numbers at once, I think it may be faster than individual calls because there is a SIMD implementation. And there are other random number generators that have different performance and statistical characteristics.

oheil · May 2, 2019, 12:57pm

On my machine:

julia> t = @benchmark balls_in_box(10^7)
BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     465.856 ms (1.48% GC)
  median time:      482.873 ms (1.43% GC)
  mean time:        495.967 ms (3.02% GC)
  maximum time:     541.727 ms (9.23% GC)
  --------------
  samples:          11
  evals/sample:     1

and

julia> rng=MersenneTwister(1234)
julia> t = @benchmark(rand(rng,1:10^7,10^7-1))
BenchmarkTools.Trial:
  memory estimate:  76.29 MiB
  allocs estimate:  4
  --------------
  minimum time:     163.490 ms (0.10% GC)
  median time:      179.165 ms (7.47% GC)
  mean time:        181.795 ms (8.60% GC)
  maximum time:     217.014 ms (25.46% GC)
  --------------
  samples:          28
  evals/sample:     1

which should be the same n-1 balls into n boxes with n=10^7.

Xing_Shi_Cai · May 2, 2019, 1:06pm

No, this is not the same. This gives which box each ball should go. I want to count the number of balls in each box.

I also do not want to generate this array rand(rng,1:10^7,10^7-1) in a batch, since I don’t want to keep it in the end. That’s why I wrote the code in the way I did.

oheil · May 2, 2019, 1:15pm

Yes, you are right, I was too hasty. But maybe analyzing rand could give some hints on how to get more out of it.

stevengj · May 2, 2019, 1:22pm

You can generate the random numbers in smaller batches to avoid the memory and cache overhead of such a large temporary array.

For example,

using Random
function balls_in_box_batched(n, nbatch=1024)
    box_seq = zeros(Int32, n)
    batch = Vector{Int}(undef, nbatch)
    i = 0
    while i + nbatch < n
        rand!(batch, 1:n)
        for j = 1:nbatch
            @inbounds box_seq[batch[j]] += 1
        end
        i += nbatch
    end
    for j = i+1:n-1
        box = rand(1:n)
        box_seq[box] += 1
    end
    box_seq
end

is about 70% faster on my machine than the un-batched version. As I said, calling rand! to generate multiple random numbers at once (in this case 1024), in a pre-allocated array, is faster than generating random numbers one at a time because it exploits a SIMD version of the Mersenne-twister random-number algorithm.

rfourquet · May 2, 2019, 4:25pm

@stevengj: the perf improvement with your batch solution is quite interesting, but I don’t think it comes from SIMD (I believe the generation of the raw underlying integers benefit from SIMD in any case, and they are in both cases transformed into an integer in 1:n in the same way, in a regular (probably non-SIMD) loop).

In my benchmarks, the batched version is puzzlingly faster for n==10^7, but not necessarily for smaller numbers. I don’t understand the reason, but hope it will sorted out!

@Xing_Shi_Cai: two modest other possible improvements to your original version are using @inbounds and Random.Sampler (which factors out repetitive computation, but this is not as spectacular as it used to be):

function balls_in_box(n)
    box_seq = zeros(Int32, n)
    sp = Random.Sampler(MersenneTwister, 1:n)
    for i = 1:n-1
        box = rand(sp)
        @inbounds box_seq[box] += 1
    end
    box_seq
end

Xing_Shi_Cai · May 2, 2019, 5:33pm

This actually decreases the time it takes to about 450 ms on my machine. That’s quite a lot of gain!

Though it’s still a bit slower than numba version of this code (360 ms). I guess maybe numpy has a faster random number generator.

from numba import jit
import numpy as np
import random

@jit(nopython=True)
def balls_in_box(n):
    box_seg = np.zeros(n, dtype=np.uint32)
    for i in range(n-1):
        box = random.randint(0, n-1)
        box_seg[box]+=1
    return box_seg

longemen3000 · May 3, 2019, 7:52am

i wanted to tackle the problem from the random number generator, so i made some methods to imitate the randint function, and based on the fact that 10^7 < 2^63 -1 = typemax(Int32).
The basis for the speed improvement is that rand(Int64) is faster that rand(1:n), and rand(Int32) is faster that rand(Int64), so using those properties i made some functions to give random ints that beat the rand(1:n) in speed

#general function and special cases for Int64 and Int32
@inline function randint(max_int::Int)
    (abs(rand(typeof(max_int))%(max_int))+1)
end

@inline function randint(max_int::Int32)
    Int32(abs(rand(Int32)%(max_int))+1)
end

@inline function randint(max_int::Int64)
    (abs(rand(Int64)%(max_int))+1)
end

#a simple implementation of randint with two numbers, it can be optimized
@inline randint(min_int::Int,max_int::Int) = min_int + randint(max_int-min_int+1) - 1

# using the generic approach by stevengj, adding some inbounds and simd
function balls_in_box(n)
    box_seq = zeros(Int32, n)
    for i = 1:n-1
        box = rand(1:n)
        box_seq[box] += 1
    end
    box_seq
end

function balls_in_box2(n)
   if n>typemax(Int32)
   n2 = Int32(n)
else
n2=n
 box_seq = zeros(Int32, n)
    for i = 1:n2-1
        box = randint(n2)
        box_seq[box] += 1
    end
    box_seq
end

t = @benchmark balls_in_box(10^7)
t2 = @benchmark balls_in_box2(10^7)

results for rand(1:n)

BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     554.599 ms (0.53% GC)
  median time:      563.745 ms (1.48% GC)
  mean time:        583.468 ms (5.06% GC)
  maximum time:     722.065 ms (23.48% GC)
  --------------
  samples:          9
  evals/sample:     1

results for randint(n)

BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     266.915 ms (0.95% GC)
  median time:      271.125 ms (2.95% GC)
  mean time:        281.768 ms (6.66% GC)
  maximum time:     431.724 ms (39.26% GC)
  --------------
  samples:          18
  evals/sample:     1

with those modifications, the function has a 2x speed improvement.

longemen3000 · May 3, 2019, 8:07am

using the batched solution:

BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  3
  --------------
  minimum time:     348.853 ms (0.58% GC)
  median time:      356.399 ms (2.27% GC)
  mean time:        370.177 ms (5.85% GC)
  maximum time:     517.501 ms (32.55% GC)
  --------------
  samples:          14
  evals/sample:     1

using the batched solution with the randint function:

function balls_in_box_batched2(n, nbatch=1024)
    if n <= typemax(Int32)
        n2 = Int32(n)
        batch = Vector{Int32}(undef, nbatch)
     else
        n2 = n
        batch = Vector{Int}(undef, nbatch)
     end
    box_seq = zeros(Int32, n)
    
    i = 0
    while i + nbatch < n
        batch .= randint.(n2)
        for j = 1:nbatch
            @inbounds box_seq[batch[j]] += 1
        end
        i += nbatch
    end
    for j = i+1:n-1
        box = rand(1:n)
        box_seq[box] += 1
    end
    box_seq
end

results:

BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  3
  --------------
  minimum time:     246.998 ms (2.02% GC)
  median time:      251.451 ms (2.02% GC)
  mean time:        257.470 ms (4.47% GC)
  maximum time:     315.042 ms (21.76% GC)
  --------------
  samples:          20
  evals/sample:     1

the use of the batched solution with improved number generation is a little bit faster, but not a lot, maybe defining a mutating form (randint) can improve things further

foobar_lv2 · May 3, 2019, 10:11am

This is biased if max_int is not a power of 2.

A pretty fast way is described in https://arxiv.org/pdf/1805.10941.pdf. @rfourquet made a PR to that effect in implement "nearly division less" algorithm for rand(a:b) by rfourquet · Pull Request #29240 · JuliaLang/julia · GitHub, but that is somehow still in limbo.

Sample implementation:

julia> function nearlydivisionless(s::Union{Int64, UInt64})
       x=rand(UInt64)
       m = widemul(x, s)
       ell = m % UInt64
       if ell<s
       t= -s % s
       while ell < t
       x=rand(UInt64)
       m = widemul(x, s)
       ell = m % UInt64
       end
       end
       return (m>>64)%typeof(s)
       end
julia> @btime rand($(1:10^7));
  23.193 ns (0 allocations: 0 bytes)

julia> @btime nearlydivisionless($(10^7));
  9.198 ns (0 allocations: 0 bytes)

For fun side projects you can just monkey-patch his PR:

julia> @eval Random begin
               #### Nearly Division Less

        # cf. https://arxiv.org/abs/1805.10941

        struct SamplerRangeNDL{U<:Unsigned,T} <: Sampler{T}
           a::T  # first element of the range
           s::U  # range length or zero for full range
       end

        function SamplerRangeNDL(r::AbstractUnitRange{T}) where {T}
           isempty(r) && throw(ArgumentError("range must be non-empty"))
           a = first(r)
           U = uint_sup(T)
           s = (last(r) - first(r)) % unsigned(T) % U + one(U) # overflow ok
           # mod(-s, s) could be put in the Sampler object for repeated calls, but
           # this would be an advantage only for very big s and number of calls
           SamplerRangeNDL(a, s)
       end

        function rand(rng::AbstractRNG, sp::SamplerRangeNDL{U,T}) where {U,T}
           s = sp.s
           x = widen(rand(rng, U))
           m = x * s
           l = m % U
           if l < s
               t = mod(-s, s)
               while l < t
                   x = widen(rand(rng, U))
                   m = x * s
                   l = m % U
               end
           end
           (s == 0 ? x : m >> (8*sizeof(U))) % T + sp.a
       end

        # API until this is made the default
       fast(r::AbstractUnitRange{<:Base.BitInteger64}) = SamplerRangeNDL(r)
       fast(r::AbstractArray) = Random.SamplerSimple(r, fast(firstindex(r):lastindex(r))) 
       end

giving

julia> s=Random.fast(1:10^7);

julia> @btime rand($s)
  9.193 ns (0 allocations: 0 bytes)
1779832
julia> s=Random.fast(Int32(1):Int32(10^7));

julia> @btime rand($s)
  9.656 ns (0 allocations: 0 bytes)
2823504

longemen3000 · May 3, 2019, 5:46pm

Nice to know!, this only works when n<<typemax(Intx) and the bias increase when n is larger. This highlights the need of the randint function?
maybe this approach is better:

@inline function randint(max_int::Int)
    onex = one(max_int)
    return onex + convert(typeof(max_int),(max_int-onex)*round(rand()))
end

@inline function randint(max_int::Int32)
    onex = one(Int32)
    onex+Int32(round((max_int-onex)*rand()))
end

@inline function randint(max_int::Int64)
    1+Int64(round((max_int-1)*rand()))
end

this is also a more idiomatic way to understand what’s happening (rand() gives values from 0 to 1, (x-1)*rand() gives values from 0 to x-1, if you add one, then you have the range 1:x:

here are some results of changing those two functions (without batching):

BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     258.018 ms (2.05% GC)
  median time:      268.982 ms (2.00% GC)
  mean time:        277.695 ms (4.75% GC)
  maximum time:     332.600 ms (22.14% GC)
  --------------
  samples:          19
  evals/sample:     1

greg_plowman · May 4, 2019, 1:07am

See RFC: implement a biased rand(a:b) (fix #20582) by rfourquet · Pull Request #28987 · JuliaLang/julia · GitHub

Topic		Replies	Views
Best way to optimize code Performance question	29	2287	May 21, 2020
rand(1:10) vs Int(round(10*rand()) Performance	17	2305	September 18, 2018
Draw two distinct random integers from `1:n` Performance statistics , performance , random	24	3212	December 3, 2021
Generators vs loops vs broadcasting: Calculate PI via Monte Carlo Sampling Performance	7	1499	June 20, 2021
Performance of Julia as a general-purpose programming language Performance	30	7643	November 19, 2017

Effective simulation of putting n-1 balls in n boxes uniformly at random

Related topics