Efficient binary vector

foobar_lv2 · June 30, 2018, 5:02pm

The random number generator is afaik already buffering its output. This is necessary because julia uses mersenne twister, which has an unholy large internal state.

So, as far as I gathered, you do not need random access to you random vector? Then you should generate as you go, and possibly add your own buffer in front.

Secondly, how many random bits do you generate? Well, each symbol has an entropy of -(p*log(p)+(p-1)*log(p-1)), or in other words, 1e10 random coin-tosses with probability 1/3 : 2/3 only need 6.3e9 random bits. Generating 64 bits for each toin-coss is wasteful.

If you do need random access, I would also recommend to not store the random array. Instead store a seed, and use a PRNG with random access to the stream; for example randomstream[n]=aes_encrpypt(key, n).

So, now your question should be: How do you transform a uniform random bitstream into a bernoulli-random bitstream? Preferably using only very few cycles / bit.

Unfortunately I don’t know. Something something asymmetric numeral systems should do the job; anyone else here knows?

AES as RNG:

My system takes (on julia 0.7) 0.84ns for mem-copying one word (8 byte), and 2.4ns for generating and storing one random word (via Random.rand! on large Vector{UInt64}) and ca 3ns for aes-encrypting one word (6ns ~ 12 cycles per encryption via @btime, without touching memory), using llvmcall and the aesni hardware instructions.

Amusingly this lines up almost perfectly with Agner Fog’s tables (11 aes instructions; these have a throughput of 1 and latency of 7 on Broadwell; this tells me that @btime is tight enough to fit into the reorder buffer, i.e. subsequent @btime iterations are actually done simultaneously. AES as a PRNG should probably do at least some buffering because we expect ~80 cycles latency for the encryption).

Tamas_Papp · June 30, 2018, 5:41pm

Possibly yes, but given that eg

sump(N, p) = sum(rand() < p for _ in 1:N)

@time sump(10^10, 1/3)

takes about 20s seconds on my (not very powerful) laptop, spending time on some sophisticated algorithm is also wasteful, in terms of developer time.

jlapeyre · June 30, 2018, 6:39pm

Secondly, how many random bits do you generate? Well, each symbol has an entropy of -(p*log§+(p-1)*log(p-1)), or in other words, 1e10 random coin-tosses with probability 1/3 : 2/3 only need 6.3e9 random bits. Generating 64 bits for each toin-coss is wasteful.

Your expression for information entropy has a sign error and should have log2 not log (if it represents code, not math). Here is the correct formula.

julia> information_entropy(p) = -p*log2(p) - (1-p)*log2(1-p);
julia> information_entropy(1/2)
1.0
julia> information_entropy(1/3)
0.918295834054489

So a random string of Boolean samples with p=1/3 should be encodable with about 0.91 bits per bit in the string.

But, this is not the task. The task is to generate booleans with p=1/3 from a string with p=1/2.

(I deleted the previous post because Discourse sent it while I was trying to insert LaTeX)

semi-colon · June 30, 2018, 6:51pm

Summing works for ordinary random walks, but not for reflected random walks. Also, I’m interested not in the final position of the walker, but rather in its maximum distance from the origin (over the entire duration of the walk). The code I have is:

function rrw(N,p)
    X = BitArray(rand()<p for t in 1:N)    
    s = 0                                                                                
    m = 0  
    for t=1:N
      s = max(0,s+2*X[t]-1)        
      m = max(m,s)
    end
    return m
end

and suggestions for improving this would be welcome. The faster, the better. Thanks!

jlapeyre · June 30, 2018, 7:03pm

Using X = [rand()<p for t in 1:N] gives a Vector{Bool}. Your code runs about 10–20 % faster this way. Not a fantastic improvement.

Tamas_Papp · June 30, 2018, 7:10pm

function rrw2(N,p)
    s = 0                                                                                
    m = 0  
    for t=1:N
        X = rand() < p
        s = max(0,s+2*X-1)        
        m = max(m,s)
    end
    return m
end

then

julia> @time rrw(10^9, 1/3)
  4.117548 seconds (10 allocations: 119.214 MiB, 0.83% gc time)
27

julia> @time rrw2(10^9, 1/3)
  2.457705 seconds (6 allocations: 192 bytes)
28

benninkrs · June 30, 2018, 7:13pm

This may not be relevant, but if p happens to be of the form k/2^n for integer k you could perform fast bitwise operations on n binary vectors to get the desired distribution. For example, for p = 1/4 take the AND of two uniformly distributed bit sequences.

jlapeyre · June 30, 2018, 8:29pm

This should probably be marked as the solution. The main question in the thread is “is it better or worse to generate an array of random numbers, rather than generating them as needed?” The answer is “worse”. At least in the vast majority of use cases.

foobar_lv2 · June 30, 2018, 9:06pm

Oops. Thanks.

A p=1/2 random string decompresses into a p=1/3 random string

Alas, Huffman needs to discretize probabilities to powers of 2, so one needs a fancy arithmetic/ANS/FSE decompressor. And I must admit that I can’t recall the details well enough to say whether this can be done chunkwise for the special case of two symbols. You could chunk 8 output bits, each symbol having inverse probability 3^k*1.5^(8-k) , where k is the number of ones, and see whether some library permits setting the table explicitly and with high precision (e.g. each probability as k/2^24). Your generator would be something like cat /dev/urandom | zstd --decompress --no-levzimpel --table=... which should be pretty fast (the command names are made up; I know that zstd uses such a compression internally).

Juser · July 1, 2018, 4:17am

If you really care about speed, you can also probably parallelize your code by computing the max for different segments of the random walk and then connecting/aggregating the results at the end.

dpsanders · July 1, 2018, 4:35am

How do you know where the walk will be in the different segments until you run the simulation? You can certainly parallelize over independent runs, though.

Juser · July 1, 2018, 4:48am

You can solve for the relative change from beginning to end of each segment as well as the max on each segment. You can then paste the endpoints together (the key step) to get the absolute levels of the walk at each of the endpoints and the absolute max on each segment. The absolute max over the whole walk is the absolute max over the segments.

dpsanders · July 1, 2018, 5:18am

What I was trying to say is that you don’t know what the segments look like if eg. there’s a reflecting barrier; you only know if the system is translationally invariant and it doesn’t matter where you start.

Juser · July 1, 2018, 4:50pm

Good point. The presence of a reflecting barrier does make straightforward parallelism impossible.

Topic		Replies	Views
How to create a random binary array with b bits? New to Julia	15	4452	January 29, 2020
Faster Bernoulli sampling Statistics	7	2067	February 27, 2020
Best way to optimize code Performance question	29	2429	May 21, 2020
Effective simulation of putting n-1 balls in n boxes uniformly at random Performance	12	1203	May 4, 2019
Performance of Julia as a general-purpose programming language Performance	30	7743	November 19, 2017

Efficient binary vector

Related topics