I think the idea is that you use statistics. If everyone is getting a random number you would not expect 65k users to all have a different number. I’m sure there’s some fancy math that @Karajan is referring to that gives more precision, but here’s my naive attempt
function bootstrap_users(ids, users)
fraction_ids = Float64[]
for i in 1:500
u = rand(1:ids, users)
push!(fraction_ids, length(unique(u)) / ids)
end
return fraction_ids
end
maximum(bootstrap_users(65_000, 65_000)) # 0.635200
maximum(bootstrap_users(65_000, 200_000)) # 0.956615
maximum(bootstrap_users(65_000, 500_000)) # 0.999754
So even with 500k users, with only 65k numbers the maximum number used in 500 iterations was 64,984
EDIT: because I couldn’t resist:
code
function bootstrap_users(ids, users)
fraction_ids = Float64
for i in 1:500
u = rand(1:ids, users)
push!(fraction_ids, length(unique(u)) / ids)
end
return fraction_ids
end
u65k = bootstrap_users(65_000, 65_000)
u200k = bootstrap_users(65_000, 200_000)
u500k = bootstrap_users(65_000, 500_000)
u600k = bootstrap_users(65_000, 600_000)
u800k = bootstrap_users(65_000, 800_000)
u1m = bootstrap_users(65_000, 1_000_000)
using StatsPlots
plot(histogram(u65k, primary=false, title=“65k”),
histogram(u200k, primary=false, title=“200k”),
histogram(u500k, primary=false, title=“500k”),
histogram(u600k, primary=false, title=“600k”),)