Efficient binary vector

question

#1

I wish to create a random vector X of length N with entries equal to either 1 or -1. For example,

using Distributions
X = 2.*rand(Bernoulli(p),N)-1

would work, where p=1/3. The catch, however, is that N=10^10. X, being Array{Int64}, consumes too much memory. But a trick like:

X = convert(Array{Int8}, 2.*rand(Bernoulli(p),N)-1)

also does not seem to help. Does anyone know a way to generate a Bernoulli§ N-vector efficiently, i.e., as a binary array (not as a 64-byte integer array)? Thank you!


#2

How about BitArray(rand() < p for x in 1:N)?


#3

Or if you really do want the +1 or -1 entries, you can use Int8[2*rand(Bernoulli(p))-1 for i in 1:N] but that won’t be as space efficient.


#4

Definitely! convert(Array{Int8}, 2.*rand(Bernoulli(p),N)-1) probably isn’t helpful because it needs to first construct 2.*rand(Bernoulli(p),N)-1 before passing it to convert. Furthermore, rand(Bernoulli(p),N) is also allocating a temporary before doing the multiplication. When you’re dealing with things this large, you want to be very careful about those temporaries. You can use broadcast fusion, but note that 10^10 bytes is still 10GB.

julia> using Compat

julia> function f(p, N)
           A = Array{Int8}(undef, N)
           A .= 2.*rand.(Bernoulli(p)) .- 1
           return A
       end
f (generic function with 1 method)

julia> @time f(.4, 10^7);
  0.087543 seconds (7 allocations: 9.537 MiB)

julia> @time f(.4, 10^8);
  0.840060 seconds (7 allocations: 95.368 MiB, 17.02% gc time)

julia> @time f(.4, 10^9);
  7.077847 seconds (7 allocations: 953.675 MiB, 1.56% gc time)

It’s still gonna take some time for 10^10, and the vast majority of the time is being spent in rand. You can cut the space down by a factor of 8 if you use a BitArray like @dawbarton suggested, but the bigger savings come from using bitrand (which directly generates that BitArray with a p=0.5 distribution). A mapped array can make it still behave like an array of Int without any temporaries:

julia> using MappedArrays

julia> @time mappedarray(x->2*x-1, bitrand(10^10))
  0.936435 seconds (1.07 k allocations: 1.164 GiB, 8.79% gc time)
10000000000-element MappedArrays.ReadonlyMappedArray{Int64,1,BitArray{1},##7#8}:
  1
 -1
…

#5

bitrand certainly is appealing, but I need to be able to set p=1/3 (or some value other than 1/2). Has anyone generalized bitrand in this manner? Grateful for your help…


#6

Looking at the implementation of bitrand it looks like it is tied to p=1/2. It generates a series of uniformally distributed UInt64 values that are (essentially) reinterpreted as a binary string giving p=1/2. I personally can’t think of any way you can change that since most changes to the underlying random number generation will mean that the probability of each bit being true or false will change depending on the bit position.

My solution of BitArray(rand() < p for x in 1:N) is the same in terms of memory consumption but requires a call of the random number generator for each bit; this is the only (easy) way I can think of getting a variable probability but it is more costly computationally. In contrast, bitrand is exploiting the p=1/2 constraint to extract as much randomness from each call of the RNG as possible.

That said, if you only need to generate the BitVector once then that’s not so bad…


#7

Consider an mmapped array, like this:

import Mmap

function random_bit_array(N, p)
    _, io = mktemp()
    A = Mmap.mmap(io, BitVector, N)
    i = 0
    @inbounds while i < N
        ix = (i+1):min(i+8, N)
        A[ix] .= rand(length(ix)) .< p
        i += 8
    end
    A
end

@time A = random_bit_array(10^10, 1/3);

Will depend on your hard disk speed, for me this is around 100s.


#8
  • rand(Bernoulli(p),N) is slow because it’s using 52 bits of randomness to generate each element.
  • bitrand(N) is fast because it’s using 1 bit of randomness per element. Even better: it’s operating 64 bits at a time.

Of course, if you want p!=0.5, you’ll need more than one bit of randomness per element. But if you’re working with small fractions — particularly combinations of fractions with powers of two in the denominator — you can construct it yourself with a simple logical truth table:

bitrand(N) .& bitrand(N) # p = 1//4
bitrand(N) .& bitrand(N) .& bitrand(N) # p = 1//8

I imagine there’s some trick to efficiently get a 1//3, but my probability theory is too rusty to see it right now.


#9

Out of curiosity, in what context would one need to generate 10^10 bernoulli random variables? Also, if you really want to save time, you could generate, say 10^5 bernoulli random draws and then append that to itself 10^5 times. (Shuffle with each append if order really matters.) Technically, the distribution will not be a bernoulli, but it will be near enough to not matter in any application that I know of.


#10

I’m interested in simulating 1D asymmetric simple random walks with reflection at the origin; see sections 1 & 2 of

https://arxiv.org/abs/1802.04615

The low-memory approach (computing incremental steps one-by-one in sequence) is slow when N=10^10. If we instead efficiently store a vector X (computed before actually executing the walk), then a time savings should be possible.


#11

Why would it be faster to first store a vector?


#12

Are you sure about this? I see no intrinsic reason for this (unless, of course, the simulated quantity depends on the path globally and there is no online statistic). Perhaps some example code would make these things more concrete.

Random number generation is deterministic once you set the seed, so it effectively “stores” the whole thing in a (relatively) small number of bytes.


#13

Maybe the poster is following habits from Python or Matlab, which train people into thinking that “vector” operations are always fast and loops are always slow. I agree that it seems unlikely to be the case here.


#14

Maybe the RNG code is evicted from a cache if it is called only when a number is needed. But, I’ve never looked into this. Several years ago, I tested whether it was worth generating an array of random numbers for some Monte Carlo code for which the cpu time consumed by the RNG was significant. It made no difference. Still, I always used an array with the Mersenne Twister. I don’t recall if it was in the original code, or if I borrowed it from elsewhere.

EDIT: I definitely saw this method at least once in a prominent source. Maybe code from Knuth. You call a routine to get a sample. The routine uses a static array and an index and refills when necessary. The idea that this is the correct way may have propagated from there.


#15

It’s easy to write routines that use 8/3 bits, on average, per sample (using rejection). I tried it last night and was not able to beat rand() < 1/3 in speed. But, it still may be possible.

Maybe there is something analogous to the algorithm that generates two normally distributed samples.


#16

That sort of technique is common when instruction & data caches come into play, and it can be faster in that case, greatly increasing locality of reference in both caches. If you have a large number of cores sharing L2/L3 caches on a physical CPU, and have a lot of tasks doing different things, it helps.


#17

Even if chunking the random generation were beneficial, surely you wouldn’t use such a gigantic cache that it would swallow that much memory? Why not use a reasonably sized cache that you periodically refill?


#18

@semi-colon, that sounds interesting. I agree with everyone else that there should be no performance improvement to pre-computing the N=10^10 random sequence. (Unless you’re pre-computing to ensure that you have repeated access to the identical sequence.)


#19

Instructions are often stored in a separate cache. So its the RNG that should stay in cache.

EDIT: I just re-read. What I mean by “cache” is not a user allocated array, but cache in the CPU/memory system.


#20

If somebody is running a large number of workers all running the same code, using the same static tables, then it won’t matter so much, but if you have a mixed workload, then the RNG will tend to get pushed out of cache, which is why a (reasonably sized) buffer, can help performance (also when dealing with multiple threads).
We had the same issue for getting unique (sequential) ids, each process would get a batch (range) of ids, that it could hand out locally, instead of each process having to single thread every time a new id was required.