Efficient binary vector

The random number generator is afaik already buffering its output. This is necessary because julia uses mersenne twister, which has an unholy large internal state.

So, as far as I gathered, you do not need random access to you random vector? Then you should generate as you go, and possibly add your own buffer in front.

Secondly, how many random bits do you generate? Well, each symbol has an entropy of -(p*log(p)+(p-1)*log(p-1)), or in other words, 1e10 random coin-tosses with probability 1/3 : 2/3 only need 6.3e9 random bits. Generating 64 bits for each toin-coss is wasteful.

If you do need random access, I would also recommend to not store the random array. Instead store a seed, and use a PRNG with random access to the stream; for example randomstream[n]=aes_encrpypt(key, n).

So, now your question should be: How do you transform a uniform random bitstream into a bernoulli-random bitstream? Preferably using only very few cycles / bit.

Unfortunately I donā€™t know. Something something asymmetric numeral systems should do the job; anyone else here knows?


AES as RNG:

My system takes (on julia 0.7) 0.84ns for mem-copying one word (8 byte), and 2.4ns for generating and storing one random word (via Random.rand! on large Vector{UInt64}) and ca 3ns for aes-encrypting one word (6ns ~ 12 cycles per encryption via @btime, without touching memory), using llvmcall and the aesni hardware instructions.

Amusingly this lines up almost perfectly with Agner Fogā€™s tables (11 aes instructions; these have a throughput of 1 and latency of 7 on Broadwell; this tells me that @btime is tight enough to fit into the reorder buffer, i.e. subsequent @btime iterations are actually done simultaneously. AES as a PRNG should probably do at least some buffering because we expect ~80 cycles latency for the encryption).

Possibly yes, but given that eg

sump(N, p) = sum(rand() < p for _ in 1:N)

@time sump(10^10, 1/3)

takes about 20s seconds on my (not very powerful) laptop, spending time on some sophisticated algorithm is also wasteful, in terms of developer time.

3 Likes

Secondly, how many random bits do you generate? Well, each symbol has an entropy of -(p*logĀ§+(p-1)*log(p-1)), or in other words, 1e10 random coin-tosses with probability 1/3 : 2/3 only need 6.3e9 random bits. Generating 64 bits for each toin-coss is wasteful.

Your expression for information entropy has a sign error and should have log2 not log (if it represents code, not math). Here is the correct formula.

julia> information_entropy(p) = -p*log2(p) - (1-p)*log2(1-p);
julia> information_entropy(1/2)
1.0
julia> information_entropy(1/3)
0.918295834054489

So a random string of Boolean samples with p=1/3 should be encodable with about 0.91 bits per bit in the string.

But, this is not the task. The task is to generate booleans with p=1/3 from a string with p=1/2.

(I deleted the previous post because Discourse sent it while I was trying to insert LaTeX)

Summing works for ordinary random walks, but not for reflected random walks. Also, Iā€™m interested not in the final position of the walker, but rather in its maximum distance from the origin (over the entire duration of the walk). The code I have is:

function rrw(N,p)
    X = BitArray(rand()<p for t in 1:N)    
    s = 0                                                                                
    m = 0  
    for t=1:N
      s = max(0,s+2*X[t]-1)        
      m = max(m,s)
    end
    return m
end

and suggestions for improving this would be welcome. The faster, the better. Thanks!

Using X = [rand()<p for t in 1:N] gives a Vector{Bool}. Your code runs about 10ā€“20 % faster this way. Not a fantastic improvement.

1 Like
function rrw2(N,p)
    s = 0                                                                                
    m = 0  
    for t=1:N
        X = rand() < p
        s = max(0,s+2*X-1)        
        m = max(m,s)
    end
    return m
end

then

julia> @time rrw(10^9, 1/3)
  4.117548 seconds (10 allocations: 119.214 MiB, 0.83% gc time)
27

julia> @time rrw2(10^9, 1/3)
  2.457705 seconds (6 allocations: 192 bytes)
28
4 Likes

This may not be relevant, but if p happens to be of the form k/2^n for integer k you could perform fast bitwise operations on n binary vectors to get the desired distribution. For example, for p = 1/4 take the AND of two uniformly distributed bit sequences.

This should probably be marked as the solution. The main question in the thread is ā€œis it better or worse to generate an array of random numbers, rather than generating them as needed?ā€ The answer is ā€œworseā€. At least in the vast majority of use cases.

3 Likes

Oops. Thanks.

A p=1/2 random string decompresses into a p=1/3 random string :wink:

Alas, Huffman needs to discretize probabilities to powers of 2, so one needs a fancy arithmetic/ANS/FSE decompressor. And I must admit that I canā€™t recall the details well enough to say whether this can be done chunkwise for the special case of two symbols. You could chunk 8 output bits, each symbol having inverse probability 3^k*1.5^(8-k) , where k is the number of ones, and see whether some library permits setting the table explicitly and with high precision (e.g. each probability as k/2^24). Your generator would be something like cat /dev/urandom | zstd --decompress --no-levzimpel --table=... which should be pretty fast (the command names are made up; I know that zstd uses such a compression internally).

If you really care about speed, you can also probably parallelize your code by computing the max for different segments of the random walk and then connecting/aggregating the results at the end.

How do you know where the walk will be in the different segments until you run the simulation? You can certainly parallelize over independent runs, though.

You can solve for the relative change from beginning to end of each segment as well as the max on each segment. You can then paste the endpoints together (the key step) to get the absolute levels of the walk at each of the endpoints and the absolute max on each segment. The absolute max over the whole walk is the absolute max over the segments.

What I was trying to say is that you donā€™t know what the segments look like if eg. thereā€™s a reflecting barrier; you only know if the system is translationally invariant and it doesnā€™t matter where you start.

2 Likes

Good point. The presence of a reflecting barrier does make straightforward parallelism impossible.