This reminds me of some stuff I was looking at a few years ago as part of a discussion on the distribution of rand. In that version, I would SIMD-compute a batch of values at a time, then refine any of them that needed further work (but still in parallel, using a mask to prevent overwriting finalized values). So I did everything in one pass rather than two and entirely with SIMD.
The thing Random.jl is missing to make that nice is a fast function for generating a block (i.e., tuple) of random bits (which should use an immutable version of the random state to avoid frequent memory writebacks). There’s space for an internal (and maybe eventually public) function to do this. Ideally, a lot of the existing vectorized random generators would change to use it as well (they already do this, but inline in a way that’s difficult to use generically).
Think about what such an interface would need to do to support your use case. Such an interface might be worth adding eventually.