What's the fastest way to fill an array with random bits without using private Random.jl internals?

Suppose I have an array of type e.g. Array{Float64} and I want to fill it with random bits as fast as possible. Note that that’s different from filling it with random floats between 0 and 1. A straight forward way to do it is

rand!(reinterpret(UInt64, a))

which is fine, but the fastest way I’ve found to do it is

GC.@preserve a rand!(Random.UnsafeView{UInt64}(pointer(a), length(a)))

I’m a little torn, because I want my package to have the best performance possible, but I also want my package to be reliable without the maintenance burden that comes with using private internals.

Is it possible to get the performance of the second method using only the public interface?

Out of curiosity, why do you want that?

unsafe_wrap seems to work:

julia> N = 1000; a = rand(N, N);

julia> b = reinterpret(UInt64, a);

julia> c = Random.UnsafeView{UInt64}(pointer(a), length(a));

julia> d = unsafe_wrap(Vector{UInt}, Ptr{UInt}(pointer(a)), length(a));

julia> @b rand!($b), rand!($c), rand!($d)
(979.646 μs, 412.827 μs, 412.214 μs)
1 Like

I’m working on a package for the ziggurat algorithm, Ziggurats.jl. It’s the same algorithm used by randn and randexp, but Ziggurats.jl allows user specified probability density functions. Without going into too many algorithmic details, to make one sample you generate a UInt64, then do some table look ups, math, etc and 99% of the time you get a result otherwise you regenerate the UInt64 and retry. When producing values in bulk, rather than generating one UInt at a time, it can be faster to just fill the array with random bits and use those as the first UInt tried for each sample. So, in Random.jl randn and randexp have lines something like this:

GC.@preserve A rand!(rng, UnsafeView{UInt64}(pointer(A), length(A)))
for i in eachindex(A)
    @inbounds A[i] = _ziggurat_rand(rng, reinterpret(UInt64, A[i]))
end

Where 1% of the time _ziggurat_rand will call itself recursively with a new UInt _ziggurat_rand(rng, rand(UInt64))

I could do the same thing, but I don’t want Random.jl to change the interface out from under me.

1 Like

Note that the aliasing Vector and Memory do contribute 2 small allocations.

julia> @btime rand!(reinterpret(UInt64, $a));
  29.800 μs (0 allocations: 0 bytes)

julia> @btime GC.@preserve $a rand!(Random.UnsafeView{UInt64}(pointer($a), length($a)));
  5.400 μs (0 allocations: 0 bytes)

julia> @btime GC.@preserve $a rand!(unsafe_wrap(Vector{UInt}, Ptr{UInt}(pointer($a)), length($a)););
  5.433 μs (2 allocations: 64 bytes)

Wonder what the 24us overhead is for.

Unsafe wrapping a julia array pointer is UB and will cause segfaults

4 Likes

What? The only things you need to stay safe from segfaults is:

  1. Only do this with bitstypes
  2. Don’t forget that the unsafe_wrap thingy doesn’t keep the underlying allocation alive. You may need to gc_preserve the underlying thing.

It seems to work.

A small issue is that julia aliasing rules used to say that Memory{UInt64} and Memory{Float64} are forbidden from aliasing, i.e. sharing memory.

Since you are lying to the compiler about the TBAA class of the underlying memory, the optimizer could theoretically miscompile your code.

You have a potential aliasing issue when you have two accesses to the same address through two differently typed Memory, and at least one of them is a write.

You can fix the aliasing issue by telling the compiler that you’re doing something fishy, i.e. by inserting a barrier between any two such accesses. The barrier can be emitted as

julia> membarrier()=Base.llvmcall("""call void asm ";","~{memory}"()
       ret void""", Cvoid, Tuple{},)

This is the same kind of barrier that you need if you play games with virtual memory, like mapping the same physical memory multiple times with different virtual addresses. And the same barrier you would need in C, C++, rust.

…all that being said, I currently cannot reproduce TBAA-based optimizations.

For example

julia> function foo(a::Vector{Int},b::Vector{UInt32},n)
       @inbounds for i=1:n
        a[1] += 1
        b[1] += UInt32(1)
       end
       nothing
       end

should run in O(1), i.e. llvm should remove the loop. But llvm fails to figure out that a and b cannot alias, and it also fails to emit a check whether they alias.

1 Like

the fast path in Random dispatches via const MutableDenseArray = Union{Base.MutableDenseArrayType{T}, UnsafeView{T}} where {T} which misses ReinterpretArray

maybe that should be a trait instead and catch when sizeof(T) == sizeof(S) && a.writable && !IsReshaped && check_ptr_indexable(a) ?

Could the generic version be improved by generating the random numbers in blocks and copying, similar to what we do for generic CRC32c in julia/stdlib/CRC32c/src/CRC32c.jl at 4969b6a31ddd4ee3e02b2fc2a477f5875967dc7a · JuliaLang/julia · GitHub

probably, but I would expect at the cost of intermediate allocations

Related package that might be useful to future readers interested in fast random sampling:

If I understand https://github.com/JuliaLang/julia/pull/53687 correctly, Julia doesn’t promise stability for llvmcall users. Also, Random.UnsafeView is planned to be removed in the future re-evaluate/remove `Random.UnsafeView` · Issue #36718 · JuliaLang/julia · GitHub

This reminds me of some stuff I was looking at a few years ago as part of a discussion on the distribution of rand. In that version, I would SIMD-compute a batch of values at a time, then refine any of them that needed further work (but still in parallel, using a mask to prevent overwriting finalized values). So I did everything in one pass rather than two and entirely with SIMD.

The thing Random.jl is missing to make that nice is a fast function for generating a block (i.e., tuple) of random bits (which should use an immutable version of the random state to avoid frequent memory writebacks). There’s space for an internal (and maybe eventually public) function to do this. Ideally, a lot of the existing vectorized random generators would change to use it as well (they already do this, but inline in a way that’s difficult to use generically).

Think about what such an interface would need to do to support your use case. Such an interface might be worth adding eventually.

I don’t know enough to comment on all that.

However, right now, for my use case, I don’t think there’s any need for a new implementation or a new function or a new interface. There already is a pretty fast way to fill an array with random bits. rand!(::Array{UInt64}) is fast. My problem is that rand!(reinterpret(UInt64, ::Array{Float64})) is slow. The workaround is GC.@preserve a rand!(Random.UnsafeView{UInt64}(pointer(a), length(a))) which is exactly as fast as rand!(::Array{UInt64}), BUT it uses private internals.

So if a new interface or new code is faster, then that would be great, but really what I want right now is for rand!(reinterpret(UInt64, ::Array{Float64})) to dispatch to the fast code that already exists in Random.jl. Earlier in the thread @adienes said something similar. Maybe they’re on to something?

Can you do it the other way around, making use of the fast method on Array{UInt64} and reinterpreting to floating point afterwards?

That’s genius, I love it. If I was making an application or standalone simulation or something like that, I might try it, but Ziggurats.jl is a library, not an application. So the array in question is provided either to the user or by the user. It’s supposed to go something like this

using Ziggurats, Random
z = ziggurat(...)
rand(z, 10^6)
# Or
a = Vector{Float64}(undef, 10^6)
rand!(a, z)

If I did what you suggest, in the first case I would have to return a ReinterpretArray, and in the second case I’d have to ask the user to provide a Vector{UInt64} and return a ReinterpretArray that aliases that same memory. Either way, it’s a leaky abstraction.

We might be missing a specialized implementation over reinterpret arrays. Because in theory it could work just fine. Something like random bits that accepts a ReinterpretArray of UIntx