How to sample vector of independent Bernoulli variables efficiently?

Hi, everyone.
I am trying to perform disorder averaging over different realizations of a lattice with some of the nodes randomly thrown away.
Specifically, consider the following model: we have a lattice (graph) with N nodes. Then each of the nodes is thrown away with probability p. Realization of the lattice is represented as a vector of N zeros and ones, where each of the components is a random variable drawn from Bernoulli distribution (probablity p for zero and (1-p) for one). Then I need to average some quantity over this random vectors.

I’ve tried to generate random vectors with a simple bruteforce method vec = [rand()<p for i in 1:N].
But, it seems, that the set of vectors generated this way converges too slowly to the desired distribution.
Can someone advise me with a more efficient way to generate random vectors of independent Bernoulli variables? The computation for each of the vectors takes about 100 seconds, so I don’t care if the generation takes a couple of seconds. But the number of runs is the problem, so I need to speed up satistical convergence.

Have you tried Distributions.jl? You can use Binomial(n,p) with n = 1 like this:

using Distributions 

p = 0.5
N = 100
d = Binomial(1,p)

v = rand(d, N)

You might also want to look at the randsubseq function in the Random standard library, which allows you to efficiently find a random (Bernouilli) subsequence of a given array. It is far more efficient than calling rand() < p for every element of the array if p is much smaller than 1. (Or conversely you can apply it to the inverse draw if 1-p if small.)

You can find the implementation here. In particular, if you are only computing some function of the subsequence, you might want to adapt the implementation of randsubseq to compute your function directly, without actually allocating/constructing the subsequence array.

How can the computation of vec take 100 seconds? Even for N=10^9, on my laptop this computation takes only 3 seconds. Remember not to benchmark in global scope — put all of your performance-critical code into functions.

I am sorry I wrote my post in a confusing way.
For each of the realisations of the lattice I solve dynamical equations of motion and extract correlation functions. Then I need to average this correlation functions over all realizations. Integration of the system of equations for a particular realization takes about 100 seconds. This time is much larger then the time I spend generating realization of a lattice and setting up a system of equations of motion. So, I don’t care how much time I spend generating a vector. The problem is how many realizations I need to average over so that the result is very close to the one obtained by averaging over the infinite number of realizations. So by efficiency I mean the speed with which the average (of correlation functions) over a finite set of random vectors converges (when we make the size of the set larger and larger) to the value obtained for infinite set. Is it possible to speed this convergence by a suitable choice of sampling algorithm? (may be I need to use some analogue of Metropolis algorithm for discrete distribution).

You could try a low-discrepancy sequence, e.g. with Sobol.jl, instead of random numbers. You could also try one of the importance-sampled Monte–Carlo methods in Cuba.jl.

But essentially you are doing very high-dimensional integration (I’m assuming N is large?), and there are no universal solutions for this — high-dimensional integration is intrinsically hard to do accurately via brute-force methods. There are a lot of specialized techniques that people have used on specific problems, but it requires some problem-specific analysis. For example, you say your problem involves modeling disorder — if the disorder is small, and you can use a perturbation-theory technique to linearize the problem as a function of the disorder, then there are a lot of analytical techniques you can employ to simplify matters (often by many orders of magnitude).

1 Like