Fastest way to simulate the birthday paradox

Trying to write a script for a video to explain some Julia programming concepts. Was trying to use the birthday “paradox” as example. It states that in a room of 23 ppl, there is a 50% chance that at least one pair of person will have the same birthday.

I am trying to write code to simulate this to confirm the results.

This is the fastest code I can come up with. It’s about 0.25s for one million simulations

I have used these tricks @inbounds, pre-allocation to avoid repeated allocation. Customised, one-by-one randomly bday generation, to avoid having to generate all n birthdays at once.

I wonder if there are other tricks? Interested in interesting tricks I have missed.

Interestingly I tried AutoPreAllocation.jl but it gave error, so that didn’t work.

# Birtyday paradox
function atleast2_v3!(d, npersons)
    d .= false
    @inbounds d[rand(1:365)] = true
    @inbounds for _ in 2:npersons
        bday = rand(1:365)
        d[bday] && return true
        d[bday] = true
    end
    false
end

function simulate_bday_v3(npersons, ntimes=1_000_000)
    cnt = 0
    d = falses(365)
    for _ in 1:ntimes
        cnt += atleast2_v3!(d, npersons)
    end
    cnt/ntimes
end

using BenchmarkTools
@time simulate_bday_v3(23, 1_000_000)

Just came up with another way which is not to keep an array of birthday at all

# Birtyday paradox
@inline function atleast2_v6(npersons)
    @inbounds for i in 1:npersons-1
        rand() < i/365 && return true
    end
    false
end

function simulate_bday_v6(npersons, ntimes=1_000_000)
    cnt = 0
    for _ in 1:ntimes
        cnt += atleast2_v6(npersons)
    end
    cnt/ntimes
end

using BenchmarkTools
@time simulate_bday_v6(23, 1_000_000)

@benchmark simulate_bday_v6(23, 1_000_000)

It’s now down to 78ms per 1 million simulations.

Multithreading will help for sure, but I think optimising single-threaded-ness is a good start. I can cut it down to 32 using 6 threads on my i7

# Birtyday paradox
@inline function atleast2_v8(npersons)
    @inbounds for i in 1:npersons-1
        365rand() < i && return true
    end
    false
end

function _simulate_bday_v8(npersons, ntimes=1_000_000)
    cnt = 0
    for _ in 1:ntimes
        cnt += atleast2_v8(npersons)
    end
    cnt
end

using Base.Threads
function simulate_bday_v8(npersons, ntimes=1_000_000)
    cnt = zeros(Int, Threads.nthreads())
    @threads for t in 1:Threads.nthreads()
        cnt[t] = _simulate_bday_v8(npersons, ntimes ÷ Threads.nthreads())
    end
    sum(cnt)/(ntimes ÷ Threads.nthreads())/Threads.nthreads()
end

using BenchmarkTools
@time simulate_bday_v8(23, 1_000_000)

@benchmark simulate_bday_v8(23, 1_000_000)

# BenchmarkTools.Trial:
#   memory estimate:  5.23 KiB
#   allocs estimate:  32
#   --------------
#   minimum time:     17.380 ms (0.00% GC)
#   median time:      18.441 ms (0.00% GC)
#   mean time:        18.734 ms (0.00% GC)
#   maximum time:     26.517 ms (0.00% GC)
#   --------------
#   samples:          267
#   evals/sample:     1
1 Like

The GPU version can reach a scary 2ms on my RTX2080 (no Ti)

BenchmarkTools.Trial:
  memory estimate:  1.28 KiB
  allocs estimate:  42
  --------------
  minimum time:     1.813 ms (0.00% GC)
  median time:      1.930 ms (0.00% GC)
  mean time:        76.165 ms (0.04% GC)
  maximum time:     304.536 ms (0.00% GC)
  --------------
  samples:          66
  evals/sample:     1

Code

# Birtyday paradox

using Pkg

Pkg.activate("c:/scratch/cuda-test")

using CUDA

CUDA.allowscalar(false)

N = 1_000_000

# The winner of bencharmks/bencharmks-countmap

function atleast2_gpu_v1!(b, e)

    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    stride = gridDim().x * blockDim().x

    t = threadIdx().x

    j = Float32(0.0)

    for idx in i:stride:size(e, 2)

        local_j = Float32(0.0)

        for k in 1:23-1

            if e[k, idx] < k/365

                local_j = max(local_j, Float32(1.0))

            end

        end

        j += local_j

    end

    @atomic b[t] = b[t] + j

    return

end

function simulate_bday_gpu_v1(; threads = 256, blocks = N ÷ 256)

    b = CUDA.zeros(Float32, threads)

    r = CUDA.rand(Float32, 23-1, N)

    CUDA.@sync @cuda threads = threads blocks = blocks atleast2_gpu_v1!(b, r)

    sum(b)/1_000_000

end

using BenchmarkTools

@benchmark CUDA.@sync blocking = false simulate_bday_gpu_v1()

4 Likes