Trying to write a script for a video to explain some Julia programming concepts. Was trying to use the birthday “paradox” as example. It states that in a room of 23 ppl, there is a 50% chance that at least one pair of person will have the same birthday.
I am trying to write code to simulate this to confirm the results.
This is the fastest code I can come up with. It’s about 0.25s for one million simulations
I have used these tricks @inbounds
, pre-allocation to avoid repeated allocation. Customised, one-by-one randomly bday generation, to avoid having to generate all n
birthdays at once.
I wonder if there are other tricks? Interested in interesting tricks I have missed.
Interestingly I tried AutoPreAllocation.jl but it gave error, so that didn’t work.
# Birtyday paradox
function atleast2_v3!(d, npersons)
d .= false
@inbounds d[rand(1:365)] = true
@inbounds for _ in 2:npersons
bday = rand(1:365)
d[bday] && return true
d[bday] = true
end
false
end
function simulate_bday_v3(npersons, ntimes=1_000_000)
cnt = 0
d = falses(365)
for _ in 1:ntimes
cnt += atleast2_v3!(d, npersons)
end
cnt/ntimes
end
using BenchmarkTools
@time simulate_bday_v3(23, 1_000_000)
Just came up with another way which is not to keep an array of birthday at all
# Birtyday paradox
@inline function atleast2_v6(npersons)
@inbounds for i in 1:npersons-1
rand() < i/365 && return true
end
false
end
function simulate_bday_v6(npersons, ntimes=1_000_000)
cnt = 0
for _ in 1:ntimes
cnt += atleast2_v6(npersons)
end
cnt/ntimes
end
using BenchmarkTools
@time simulate_bday_v6(23, 1_000_000)
@benchmark simulate_bday_v6(23, 1_000_000)
It’s now down to 78ms per 1 million simulations.
Multithreading will help for sure, but I think optimising single-threaded-ness is a good start. I can cut it down to 32 using 6 threads on my i7
# Birtyday paradox
@inline function atleast2_v8(npersons)
@inbounds for i in 1:npersons-1
365rand() < i && return true
end
false
end
function _simulate_bday_v8(npersons, ntimes=1_000_000)
cnt = 0
for _ in 1:ntimes
cnt += atleast2_v8(npersons)
end
cnt
end
using Base.Threads
function simulate_bday_v8(npersons, ntimes=1_000_000)
cnt = zeros(Int, Threads.nthreads())
@threads for t in 1:Threads.nthreads()
cnt[t] = _simulate_bday_v8(npersons, ntimes ÷ Threads.nthreads())
end
sum(cnt)/(ntimes ÷ Threads.nthreads())/Threads.nthreads()
end
using BenchmarkTools
@time simulate_bday_v8(23, 1_000_000)
@benchmark simulate_bday_v8(23, 1_000_000)
# BenchmarkTools.Trial:
# memory estimate: 5.23 KiB
# allocs estimate: 32
# --------------
# minimum time: 17.380 ms (0.00% GC)
# median time: 18.441 ms (0.00% GC)
# mean time: 18.734 ms (0.00% GC)
# maximum time: 26.517 ms (0.00% GC)
# --------------
# samples: 267
# evals/sample: 1
1 Like
The GPU version can reach a scary 2ms on my RTX2080 (no Ti)
BenchmarkTools.Trial:
memory estimate: 1.28 KiB
allocs estimate: 42
--------------
minimum time: 1.813 ms (0.00% GC)
median time: 1.930 ms (0.00% GC)
mean time: 76.165 ms (0.04% GC)
maximum time: 304.536 ms (0.00% GC)
--------------
samples: 66
evals/sample: 1
Code
# Birtyday paradox
using Pkg
Pkg.activate("c:/scratch/cuda-test")
using CUDA
CUDA.allowscalar(false)
N = 1_000_000
# The winner of bencharmks/bencharmks-countmap
function atleast2_gpu_v1!(b, e)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
t = threadIdx().x
j = Float32(0.0)
for idx in i:stride:size(e, 2)
local_j = Float32(0.0)
for k in 1:23-1
if e[k, idx] < k/365
local_j = max(local_j, Float32(1.0))
end
end
j += local_j
end
@atomic b[t] = b[t] + j
return
end
function simulate_bday_gpu_v1(; threads = 256, blocks = N ÷ 256)
b = CUDA.zeros(Float32, threads)
r = CUDA.rand(Float32, 23-1, N)
CUDA.@sync @cuda threads = threads blocks = blocks atleast2_gpu_v1!(b, r)
sum(b)/1_000_000
end
using BenchmarkTools
@benchmark CUDA.@sync blocking = false simulate_bday_gpu_v1()
6 Likes