log(N) algo for sampling from vector


#1

Hi,

I am using the following piece of code

using StatsBase
pf = StatsBase.Weights(rand(10))
e = sample(pf)

Upon using @which sample(pf), one gets
sample(wv::StatsBase.AbstractWeights) in StatsBase at .../StatsBase/src/sampling.jl:425

for which the algorithm is super straightforward. For example, if t in this algorithm is very close to sum(wv), I would say it is better to go over wv starting from the end.

Hence my question, is it possible to do something like a binary search (like searchsorted) to do the same as StatsBase.sample?

I tried and I could not.

Thank you for your help,

Best regards


#2

Look further in that file for very efficient methods, eg the alias method.


#3

I forgot to mention that I need to draw only one sample from the vector.


#4

If all you have is a vector of relative weights, then you need to do at least some sort of O(n) preprocessing (since you have no idea where the weight is).

If there is some way that you could get the cumulative weights cheaply (i.e. cheaper than cpf = cumsum(pf), which is also O(n)), then you could do:

u = rand()*cpf[end]
searchsortedfirst(cpf, u)

which is O(log(n)).


#5

I tend to agree for most of your answer… I would improve the algo by starting from the beginning and from the last element at the same time though.


#6

I’m not sure what you mean: could you explain further?