Efficiently resample a large vector?

JedPhillips · June 22, 2022, 9:58am

I’m doing some simulations and need to efficiently sample a set of items, and then subset/filter a large vector (1>million entries) whose elements are contained in the sampled items. Here’s some minimal code.

Random.seed!(1234)
# generate some random data and corresponding element weights
DF = sample(1:15000, 1500000, replace = true)
wts = rand(Uniform(), 15000)
@time for s in 1:500
	# sample a subset of elements based on the weights
	samp = sample(1:15000, Weights(wts), 100, replace = false)
	# sample a single index from DF corresponding to each element
	sim_inds = [rand(findall(x-> x == i, DF)) for i in samp]
	# now use sim_inds for additional calculations....
end
 26.333749 seconds (788.80 k allocations: 8.999 GiB, 0.39% gc time, 0.39% compilation time)

I’m just suprised at how long this takes. I’ve tried various other approaches using dataframes, groupby, and so on, but the bottleneck always seems to be the findall step, or its equivalent, where I filter/match the sampled elements to the large vector. Any thoughts on how to speed this up?

EDIT: I’ve solved my own question in the act of posting it. StratifiedRandomSub in MLBase provides a massive speed boost:

@time begin
	# sample one index per element
	strat = hcat(collect(StratifiedRandomSub(DF, 15000, 500))...)
	# subsample the indices based on weights
	a = [sample(strat[:,i], Weights(wts), 100) for i in 1:500]
end
1.426467 seconds (7.62 M allocations: 903.555 MiB, 5.83% gc time, 2.54% compilation time)

But any further efficiency boost would be much appreciated!

DNF · June 22, 2022, 10:23am

You should never do benchmarking on non-const variables in global scope. Wrap your code in a function, and pass variables as input and output arguments.

Further performance tips here: Performance Tips · The Julia Language But the above tip is number 1.

And use the BenchmarkTools.jl package for benchmarking.

DNF · June 22, 2022, 10:29am

And one more thing: Try to make your example self-contained. For example, now Random.seed!(), Uniform() and Weights() are undefined and presumably come from some packages. Can you include the necessary code to load the correct libraries?

lrnv · June 22, 2022, 2:45pm

A 1000x speedup by precomputing the indices :

using Random, Distributions, StatsBase, BenchmarkTools

# generate some random data and corresponding element weights
Random.seed!(1234)
DF = sample(1:15000, 1500000, replace = true)
wts = rand(Uniform(), 15000)
function original(DF,wts,N)
    for s in 1:N
        # sample a subset of elements based on the weights
        samp = sample(1:15000, Weights(wts), 100, replace = false)
        # sample a single index from DF corresponding to each element
        sim_inds = [rand(findall(x-> x == i, DF)) for i in samp]
        # now use sim_inds for additional calculations....
    end
end


w = Weights(wts)
indices = [findall(x -> x==i,DF) for i in 1:15000]
function my_try(indices,w,N)
    sim_inds = zeros(Int64,100)
    for s in 1:N
        samp = sample(1:15000, w, 100, replace = false)
        for i in 1:length(samp)
            sim_inds[i] = rand(indices[samp[i]])
        end
        # now use sim_inds for additional calculations....
    end
end 


@btime original(DF,wts,10) # 828ms
@btime my_try(indices,w,10) # 843μs

This will get even beter if you increase the number of repeats.

JedPhillips · June 23, 2022, 10:40am

Thanks for the suggestion. But the timing ignores the time spent calculating the indices, which is still the bottleneck:

function get_inds(z::Vector{Int64}, r::UnitRange{Int64})
	indices = [findall(x -> x==i,z) for i in r]
	return indices
end

@btime get_inds(DF, 1:15000)
# 7.585 s (120040 allocations: 2.70 GiB)

rocco_sprmnt21 · June 23, 2022, 10:52am

If this does what you are looking for, try a little …


d=Dict{Int, Vector{Int}}()
foreach(((i,e),)->push!(get!(()->Int[], d,e),i), enumerate(DF))
values(d)  # or better simply d 

julia> Set(values(d))==Set(indices)
true

Topic		Replies	Views
log(N) algo for sampling from vector Performance	5	919	February 7, 2018
Speed of sampling with replacement General Usage question	0	475	August 12, 2017
How to sample data without using external packages? General Usage question , memory-allocation , random	12	569	March 1, 2022
How to sample rows from a dataframe? New to Julia	2	2949	January 18, 2019
Vectorize and parallelize wsample function New to Julia	5	1233	August 21, 2019

Efficiently resample a large vector?

Related topics