I’m doing some simulations and need to efficiently sample a set of items, and then subset/filter a large vector (1>million entries) whose elements are contained in the sampled items. Here’s some minimal code.
Random.seed!(1234)
# generate some random data and corresponding element weights
DF = sample(1:15000, 1500000, replace = true)
wts = rand(Uniform(), 15000)
@time for s in 1:500
# sample a subset of elements based on the weights
samp = sample(1:15000, Weights(wts), 100, replace = false)
# sample a single index from DF corresponding to each element
sim_inds = [rand(findall(x-> x == i, DF)) for i in samp]
# now use sim_inds for additional calculations....
end
26.333749 seconds (788.80 k allocations: 8.999 GiB, 0.39% gc time, 0.39% compilation time)
I’m just suprised at how long this takes. I’ve tried various other approaches using dataframes, groupby, and so on, but the bottleneck always seems to be the findall step, or its equivalent, where I filter/match the sampled elements to the large vector. Any thoughts on how to speed this up?
EDIT: I’ve solved my own question in the act of posting it. StratifiedRandomSub in MLBase provides a massive speed boost:
@time begin
# sample one index per element
strat = hcat(collect(StratifiedRandomSub(DF, 15000, 500))...)
# subsample the indices based on weights
a = [sample(strat[:,i], Weights(wts), 100) for i in 1:500]
end
1.426467 seconds (7.62 M allocations: 903.555 MiB, 5.83% gc time, 2.54% compilation time)
But any further efficiency boost would be much appreciated!
You should never do benchmarking on non-const variables in global scope. Wrap your code in a function, and pass variables as input and output arguments.
And one more thing: Try to make your example self-contained. For example, now Random.seed!(), Uniform() and Weights() are undefined and presumably come from some packages. Can you include the necessary code to load the correct libraries?
using Random, Distributions, StatsBase, BenchmarkTools
# generate some random data and corresponding element weights
Random.seed!(1234)
DF = sample(1:15000, 1500000, replace = true)
wts = rand(Uniform(), 15000)
function original(DF,wts,N)
for s in 1:N
# sample a subset of elements based on the weights
samp = sample(1:15000, Weights(wts), 100, replace = false)
# sample a single index from DF corresponding to each element
sim_inds = [rand(findall(x-> x == i, DF)) for i in samp]
# now use sim_inds for additional calculations....
end
end
w = Weights(wts)
indices = [findall(x -> x==i,DF) for i in 1:15000]
function my_try(indices,w,N)
sim_inds = zeros(Int64,100)
for s in 1:N
samp = sample(1:15000, w, 100, replace = false)
for i in 1:length(samp)
sim_inds[i] = rand(indices[samp[i]])
end
# now use sim_inds for additional calculations....
end
end
@btime original(DF,wts,10) # 828ms
@btime my_try(indices,w,10) # 843μs
This will get even beter if you increase the number of repeats.
Thanks for the suggestion. But the timing ignores the time spent calculating the indices, which is still the bottleneck:
function get_inds(z::Vector{Int64}, r::UnitRange{Int64})
indices = [findall(x -> x==i,z) for i in r]
return indices
end
@btime get_inds(DF, 1:15000)
# 7.585 s (120040 allocations: 2.70 GiB)