I am trying to learn Julia so I wrote a bit code to test its performance.
In the following code snippet I created a function balls_in_box to simulate putting n-1 balls in n boxes uniformly at random.
One my machine, on average the function takes 600ms for 10^7 boxes.
Is there more space for improvement of the speed?

using Random
using BenchmarkTools
Random.seed!(1234)
function balls_in_box(n::Int)
box_seq::Vector{Int32} = zeros(Int32, n)
for i::Int in 1:(n-1)
box::Int = rand(1:n)
box_seq[box]+=1
end
box_seq
end
t = @benchmark balls_in_box(10^7)
show(stdout, "text/plain", t)
println()

Note that your type declarations donâ€™t make anything faster. The function

function balls_in_box(n)
box_seq = zeros(Int32, n)
for i = 1:n-1
box = rand(1:n)
box_seq[box] += 1
end
box_seq
end

will run with exactly the same speed. When you call balls_in_box(10^7), Julia automatically compiles a specialized version for n::Int, and it is able to infer all of the other types already.

In your problem it seems that virtually all of the time is spent in random-number generation, so the only way to speed it up significantly is to make rand faster. This is not impossible. For example, if you allocate a small buffer and generate a batch of random numbers at once, I think it may be faster than individual calls because there is a SIMD implementation. And there are other random number generators that have different performance and statistical characteristics.

No, this is not the same. This gives which box each ball should go. I want to count the number of balls in each box.

I also do not want to generate this array rand(rng,1:10^7,10^7-1) in a batch, since I donâ€™t want to keep it in the end. Thatâ€™s why I wrote the code in the way I did.

You can generate the random numbers in smaller batches to avoid the memory and cache overhead of such a large temporary array.

For example,

using Random
function balls_in_box_batched(n, nbatch=1024)
box_seq = zeros(Int32, n)
batch = Vector{Int}(undef, nbatch)
i = 0
while i + nbatch < n
rand!(batch, 1:n)
for j = 1:nbatch
@inbounds box_seq[batch[j]] += 1
end
i += nbatch
end
for j = i+1:n-1
box = rand(1:n)
box_seq[box] += 1
end
box_seq
end

is about 70% faster on my machine than the un-batched version. As I said, calling rand! to generate multiple random numbers at once (in this case 1024), in a pre-allocated array, is faster than generating random numbers one at a time because it exploits a SIMD version of the Mersenne-twister random-number algorithm.

@stevengj: the perf improvement with your batch solution is quite interesting, but I donâ€™t think it comes from SIMD (I believe the generation of the raw underlying integers benefit from SIMD in any case, and they are in both cases transformed into an integer in 1:n in the same way, in a regular (probably non-SIMD) loop).

In my benchmarks, the batched version is puzzlingly faster for n==10^7, but not necessarily for smaller numbers. I donâ€™t understand the reason, but hope it will sorted out!

@Xing_Shi_Cai: two modest other possible improvements to your original version are using @inbounds and Random.Sampler (which factors out repetitive computation, but this is not as spectacular as it used to be):

function balls_in_box(n)
box_seq = zeros(Int32, n)
sp = Random.Sampler(MersenneTwister, 1:n)
for i = 1:n-1
box = rand(sp)
@inbounds box_seq[box] += 1
end
box_seq
end

This actually decreases the time it takes to about 450 ms on my machine. Thatâ€™s quite a lot of gain!

Though itâ€™s still a bit slower than numba version of this code (360 ms). I guess maybe numpy has a faster random number generator.

from numba import jit
import numpy as np
import random
@jit(nopython=True)
def balls_in_box(n):
box_seg = np.zeros(n, dtype=np.uint32)
for i in range(n-1):
box = random.randint(0, n-1)
box_seg[box]+=1
return box_seg

i wanted to tackle the problem from the random number generator, so i made some methods to imitate the randint function, and based on the fact that 10^7 < 2^63 -1 = typemax(Int32).
The basis for the speed improvement is that rand(Int64) is faster that rand(1:n), and rand(Int32) is faster that rand(Int64), so using those properties i made some functions to give random ints that beat the rand(1:n) in speed

#general function and special cases for Int64 and Int32
@inline function randint(max_int::Int)
(abs(rand(typeof(max_int))%(max_int))+1)
end
@inline function randint(max_int::Int32)
Int32(abs(rand(Int32)%(max_int))+1)
end
@inline function randint(max_int::Int64)
(abs(rand(Int64)%(max_int))+1)
end
#a simple implementation of randint with two numbers, it can be optimized
@inline randint(min_int::Int,max_int::Int) = min_int + randint(max_int-min_int+1) - 1
# using the generic approach by stevengj, adding some inbounds and simd
function balls_in_box(n)
box_seq = zeros(Int32, n)
for i = 1:n-1
box = rand(1:n)
box_seq[box] += 1
end
box_seq
end
function balls_in_box2(n)
if n>typemax(Int32)
n2 = Int32(n)
else
n2=n
box_seq = zeros(Int32, n)
for i = 1:n2-1
box = randint(n2)
box_seq[box] += 1
end
box_seq
end
t = @benchmark balls_in_box(10^7)
t2 = @benchmark balls_in_box2(10^7)

results for rand(1:n)

BenchmarkTools.Trial:
memory estimate: 38.15 MiB
allocs estimate: 2
--------------
minimum time: 554.599 ms (0.53% GC)
median time: 563.745 ms (1.48% GC)
mean time: 583.468 ms (5.06% GC)
maximum time: 722.065 ms (23.48% GC)
--------------
samples: 9
evals/sample: 1

results for randint(n)

BenchmarkTools.Trial:
memory estimate: 38.15 MiB
allocs estimate: 2
--------------
minimum time: 266.915 ms (0.95% GC)
median time: 271.125 ms (2.95% GC)
mean time: 281.768 ms (6.66% GC)
maximum time: 431.724 ms (39.26% GC)
--------------
samples: 18
evals/sample: 1

with those modifications, the function has a 2x speed improvement.

BenchmarkTools.Trial:
memory estimate: 38.15 MiB
allocs estimate: 3
--------------
minimum time: 348.853 ms (0.58% GC)
median time: 356.399 ms (2.27% GC)
mean time: 370.177 ms (5.85% GC)
maximum time: 517.501 ms (32.55% GC)
--------------
samples: 14
evals/sample: 1

using the batched solution with the randint function:

function balls_in_box_batched2(n, nbatch=1024)
if n <= typemax(Int32)
n2 = Int32(n)
batch = Vector{Int32}(undef, nbatch)
else
n2 = n
batch = Vector{Int}(undef, nbatch)
end
box_seq = zeros(Int32, n)
i = 0
while i + nbatch < n
batch .= randint.(n2)
for j = 1:nbatch
@inbounds box_seq[batch[j]] += 1
end
i += nbatch
end
for j = i+1:n-1
box = rand(1:n)
box_seq[box] += 1
end
box_seq
end

results:

BenchmarkTools.Trial:
memory estimate: 38.15 MiB
allocs estimate: 3
--------------
minimum time: 246.998 ms (2.02% GC)
median time: 251.451 ms (2.02% GC)
mean time: 257.470 ms (4.47% GC)
maximum time: 315.042 ms (21.76% GC)
--------------
samples: 20
evals/sample: 1

the use of the batched solution with improved number generation is a little bit faster, but not a lot, maybe defining a mutating form (randint) can improve things further

julia> function nearlydivisionless(s::Union{Int64, UInt64})
x=rand(UInt64)
m = widemul(x, s)
ell = m % UInt64
if ell<s
t= -s % s
while ell < t
x=rand(UInt64)
m = widemul(x, s)
ell = m % UInt64
end
end
return (m>>64)%typeof(s)
end
julia> @btime rand($(1:10^7));
23.193 ns (0 allocations: 0 bytes)
julia> @btime nearlydivisionless($(10^7));
9.198 ns (0 allocations: 0 bytes)

For fun side projects you can just monkey-patch his PR:

julia> @eval Random begin
#### Nearly Division Less
# cf. https://arxiv.org/abs/1805.10941
struct SamplerRangeNDL{U<:Unsigned,T} <: Sampler{T}
a::T # first element of the range
s::U # range length or zero for full range
end
function SamplerRangeNDL(r::AbstractUnitRange{T}) where {T}
isempty(r) && throw(ArgumentError("range must be non-empty"))
a = first(r)
U = uint_sup(T)
s = (last(r) - first(r)) % unsigned(T) % U + one(U) # overflow ok
# mod(-s, s) could be put in the Sampler object for repeated calls, but
# this would be an advantage only for very big s and number of calls
SamplerRangeNDL(a, s)
end
function rand(rng::AbstractRNG, sp::SamplerRangeNDL{U,T}) where {U,T}
s = sp.s
x = widen(rand(rng, U))
m = x * s
l = m % U
if l < s
t = mod(-s, s)
while l < t
x = widen(rand(rng, U))
m = x * s
l = m % U
end
end
(s == 0 ? x : m >> (8*sizeof(U))) % T + sp.a
end
# API until this is made the default
fast(r::AbstractUnitRange{<:Base.BitInteger64}) = SamplerRangeNDL(r)
fast(r::AbstractArray) = Random.SamplerSimple(r, fast(firstindex(r):lastindex(r)))
end

Nice to know!, this only works when n<<typemax(Intx) and the bias increase when n is larger. This highlights the need of the randint function?
maybe this approach is better:

@inline function randint(max_int::Int)
onex = one(max_int)
return onex + convert(typeof(max_int),(max_int-onex)*round(rand()))
end
@inline function randint(max_int::Int32)
onex = one(Int32)
onex+Int32(round((max_int-onex)*rand()))
end
@inline function randint(max_int::Int64)
1+Int64(round((max_int-1)*rand()))
end

this is also a more idiomatic way to understand whatâ€™s happening (rand() gives values from 0 to 1, (x-1)*rand() gives values from 0 to x-1, if you add one, then you have the range 1:x:

here are some results of changing those two functions (without batching):

BenchmarkTools.Trial:
memory estimate: 38.15 MiB
allocs estimate: 2
--------------
minimum time: 258.018 ms (2.05% GC)
median time: 268.982 ms (2.00% GC)
mean time: 277.695 ms (4.75% GC)
maximum time: 332.600 ms (22.14% GC)
--------------
samples: 19
evals/sample: 1