Linear relationship between Xoshiro tasks

foobar_lv2 · February 20, 2024, 11:26am

Cool, I didn’t see that! Quite sweet addition to my initial unprincipled PR.

However, I fear that your construction is somewhat broken.

The problem is that you can easily construct task trees that imply linear relations between xoshiro states. That would not be a problem if you advanced the internal xoshiro state in between.

However, xoshiro is designed such that extracting a single number does almost no output mixing; it basically dumps one of its internal states, and then updates the internal state.

This is very good for superscalar CPU: It is expected that the rand output will be required before the next rand output, so the CPU can schedule such that the output is available immediately for further computations, and it can update the internal state in the background, in parallel to whatever the user wants to do with the rand output.

So, first, to demonstrate (on 1.9) that linear relations are bad for the first output: Suppose we have 4 xoshiro instances with the linear relation A+B=C+D. Then we get:

julia> xoro0 = Random.Xoshiro(); xoro1=Random.Xoshiro();  xoro2=Random.Xoshiro();  xoro3 = Random.Xoshiro(xoro0.s0 + xoro1.s0 - xoro2.s0, xoro0.s1 + xoro1.s1 - xoro2.s1, xoro0.s2 + xoro1.s2 - xoro2.s2, xoro0.s3 + xoro1.s3 - xoro2.s3);

julia> r0=rand(xoro0, UInt64); r1=rand(xoro1, UInt64); r2=rand(xoro2, UInt64); r3=rand(xoro3, UInt64);  r0 + r1 - r2 - r3
0xffffffffff800000

julia> r0=rand(xoro0, UInt64); r1=rand(xoro1, UInt64); r2=rand(xoro2, UInt64); r3=rand(xoro3, UInt64);  r0 + r1 - r2 - r3
0x5464e464eb865237

julia> r0=rand(xoro0, UInt64); r1=rand(xoro1, UInt64); r2=rand(xoro2, UInt64); r3=rand(xoro3, UInt64);  r0 + r1 - r2 - r3
0xc9ec2b54e02cb66d

We can repeat that:

julia> xoro0 = Random.Xoshiro(); xoro1=Random.Xoshiro();  xoro2=Random.Xoshiro();  xoro3 = Random.Xoshiro(xoro0.s0 + xoro1.s0 - xoro2.s0, xoro0.s1 + xoro1.s1 - xoro2.s1, xoro0.s2 + xoro1.s2 - xoro2.s2, xoro0.s3 + xoro1.s3 - xoro2.s3);

julia> r0=rand(xoro0, UInt64); r1=rand(xoro1, UInt64); r2=rand(xoro2, UInt64); r3=rand(xoro3, UInt64);  r0 + r1 - r2 - r3
0x0000000000000000

julia> r0=rand(xoro0, UInt64); r1=rand(xoro1, UInt64); r2=rand(xoro2, UInt64); r3=rand(xoro3, UInt64);  r0 + r1 - r2 - r3
0xf7d6406723374553

julia> r0=rand(xoro0, UInt64); r1=rand(xoro1, UInt64); r2=rand(xoro2, UInt64); r3=rand(xoro3, UInt64);  r0 + r1 - r2 - r3
0xc4b51c8b2237bb0e

Ok, how do we build a task tree that has linear relations? Quite easily, thanks to your excellent documentation – we simply need to use the vectors Base + (0,0) + Base + (1,1) = Base + (1,0) + Base + (0,1).

So, now running on 1.10 (update running extra for this demo):

julia> function badXoro()
       fut = Threads.@spawn begin 
       f11 = Threads.@spawn rand(UInt64)
       _r10 = rand(UInt64)
       (fetch(f11), _r10)
       end
       r11, r10 = fetch(fut)
       f01 = Threads.@spawn rand(UInt64)
       r01 = fetch(f01)
       r00 = rand(UInt64)
       r01 + r10 - r00 - r11
       end
badXoro (generic function with 1 method)

julia> badXoro()
0x0000000000000000

Do we want to split this off into a new thread? Or move to github?

PS. What I would conceptionally do is:

When forking a task, increment rngState[4] of the parent.
When forking a task, copy its parent’s rngState[0:4] and then apply a random permutation or random function on rngState[0:4], interpreted as a 320 bit unsigned integer.

This should be reasonable to prove good in the random oracle model.

Now, we have a mere turing machine, no random oracle available. So we approximate the random oracle by e.g. a cryptographic function. Maybe a reduced round chacha, or sha or aes variant. This only gets called on task forking, so we can spare some cycles (aes is super fast if we use the hardware support that almost all modern CPU have; but it is a royal pain on build systems to do that for all targets).

If you want a minimally invasive unprincipled quickfix, then keep everything, but advance the child RNG. Once should (?) be enough, but my intuition says to advance 4 times.

Btw, good choice to add the PCG output instead of xoring it. Looks much better with respect to interactions with xoshiro.

PPS. I think we should also take a look at our testing infra. This kind of relation should have been caught by tests – it is a very obvious linear relation in the output. The fact that the tests didn’t catch this implies that they are lacking somewhere.

PPPS. I took a look at the papers you linked. Unfortunately this fish stinks from the head, so to say.

The standard definition of “reasonable PRNG” is a map from finite words in SEED(u) EXTRACT+ to e.g. UInt64. That is, given a seed u and N>0 extractions, we get a number (the last number we extracted). This is a good PRNG if it is practically indistinguishable from a random function.

For the original model where parent state is mutated on forking, we had such a map from SEED(u) (PARENT | CHILD | EXTRACT)* EXTRACT.

You added the requirement that forking a task does not change the random stream; so we instead have a map from

SEED(u) ( (PARENT | CHILD | EXTRACT)* CHILD |  (CHILD|EXTRACT)* )EXTRACT+`

i.e. we exclude words that end in PARENT EXTRACT+ because for these the last PARENT call should not influence the random stream.

These maps should now be indistinguishable from random oracle.

This property breaks with my above example, because of the linear relation

rand(S E) + rand(S C C E) == rand(S P C E) + rand(S C E)

The authors of your linked paper Deterministic Parallel Random-Number Generation for Dynamic-Multithreading Platforms in the PR only consider pairwise correlations between random streams.

Their proof may or may not work, I don’t care – the problem is a catastrophic 4-way correlation. I must admit that I am quite shocked that they did not even mention the possibility of higher order correlations, especially since they acknowledge input from Rivest & Shor, and these two should know better than to stop at 2-way correlations (or should at least know that this is a super important limitation to mention!).

StefanKarpinski · February 20, 2024, 3:52pm

To understand your first example better, I’ve reformatted it and wrapped it in a function:

function correlatedXoshiros()
    x0 = Random.Xoshiro()
    x1 = Random.Xoshiro()
    x2 = Random.Xoshiro()
    x3 = Random.Xoshiro(
        x0.s0 + x1.s0 - x2.s0,
        x0.s1 + x1.s1 - x2.s1,
        x0.s2 + x1.s2 - x2.s2,
        x0.s3 + x1.s3 - x2.s3,
    )
    r0 = rand(x0, UInt64)
    r1 = rand(x1, UInt64)
    r2 = rand(x2, UInt64)
    r3 = rand(x3, UInt64)
    return r0 + r1 - r2 - r3
end

Sample outputs:

julia> correlatedXoshiros()
0x0000000000000000

julia> correlatedXoshiros()
0xffffffffff800000

julia> correlatedXoshiros()
0x0000000000000000

julia> correlatedXoshiros()
0x0000000000000001

There seem to be nine possible outputs altogether:

julia> sort!(unique(correlatedXoshiros() for _ = 1:1000))
9-element Vector{UInt64}:
 0x0000000000000000
 0x0000000000000001
 0x00000000007fffff
 0x0000000000800000
 0x0000000000800001
 0xffffffffff7fffff
 0xffffffffff800000
 0xffffffffff800001
 0xffffffffffffffff

So far this doesn’t have anything to do with task forking and just shows that if you do end up in a state where you have a linear relationship between the internal states of four Xoshiro256++ RNGs, then there is also an approximate linear relationship bretween the outputs of those RNGs. As you explain, this is because the output function is pretty much trivial for Xoshiro256++. So far this is not a problem and just demonstrates that if you construct a bad situation multiple Xoshiro256++ RNGs, you can get problems. Of course, we seed our RNGs using strong methods like calling RandomDevice or hashing seeds with SHA256, so this isn’t a problem.

The second example shows how our current task splitting mechanism can lead to such a situation where RNGs states have a linear relationship.

using .Threads

macro fetch(ex) :(fetch(@spawn($(esc(ex))))) end

function taskCorrelatedXoshiros()
    r11, r10 = @fetch (@fetch(rand(UInt64)), rand(UInt64))
    r01, r00 = (@fetch(rand(UInt64)), rand(UInt64))
    r01 + r10 - r00 - r11
end

Calling this function gives the same set of nine not-very-random values:

julia> sort!(unique(taskCorrelatedXoshiros() for _ = 1:1000))
9-element Vector{UInt64}:
 0x0000000000000000
 0x0000000000000001
 0x00000000007fffff
 0x0000000000800000
 0x0000000000800001
 0xffffffffff7fffff
 0xffffffffff800000
 0xffffffffff800001
 0xffffffffffffffff

This clearly establishes that it’s possible to create a situation where four tasks have linearly correlated RNG states without messing around with the RNG internals, just by creating the right task structure, with a fairly simple construction.

That pretty clearly establishes that there’s a problem. Now, let’s turn to what to do about it.

Yes, this would certainly do it. It relies quite heavily on the cascade properties of the cryptographic function, but that’s what they’re designed for, so it should work. I would agree that we have some cycles to spare if we knew that the RNG was actually going to be used in each task but that’s not the case and I’d argue that we need this to be as fast as possible because it happens on every task spawn, whether we’re using RNGs or not.

Don’t you end up in the same situation if you manually advance the parent RNG then?

Another fix would be to perturb the child state with an operation that is even less commutative with Xoshiro’s internal operations. For example, multiply each register by the perturbation value (with the last bit flipped on to make it odd). Multiplication has a quite different algebraic structure from anything that Xoshiro uses.

We don’t have any tests for this kind of thing. Adding some would be good.

I think fundamentally the issue here is that the dot product construction that SplitMix and DotMix (the predecessor to SplitMix) rely on, inherently produce seed values that are linearly related—it’s a dot product, after all, that’s literally the construction! That linearity always bugged me a bit, but SplitMix is peer reviewed, highly regarded, used in Java, and even recommended by other RNG researchers.

The core issue here is that the perturbation values produced by splitmix-like constructions inherently have linear relationships. If we use those linearly related perturbation values in a linear fashion, the result is linear relationships in the states of different tasks’ RNGs. We can either not use a dot-product-based approach—e.g. use AES or SHA or something like that—or we can try to make the perturbation less linear—e.g. change the addition to a multiplication.

foobar_lv2 · February 20, 2024, 4:47pm

No. After the problematic forking structure, we end up a linear relation xoro_00 + xoro_11 == xoro_01 + xoro_10.
This linear relation leads to a bad linear relation
rand(xoro_00) + rand(xoro_11) \approx rand(xoro_01) + rand(xoro_10).

However, the bad linear relation goes away after a couple extractions (xoshiro has a quite nonlinear and cascading update function of the internal state!). It is not obvious to me whether any bad relations remain in e.g.
[rand(advance(xoro_00)), rand(advance(xoro_01)), rand(advance(xoro_10)), rand(advance(xoro_11))]

I would be willing to bet one drink (but no more!) that no practically problematic relations remain after 4 extractions. I am unwilling to bet on results after a single extraction.

In other words, we need to make sure that the linear structure from the dot product gets mixed away in the child before we expose the tainted (correlated!) numbers to the user. (we cannot mix it away in the parent, because we’re not allowed to touch its random stream)

I agree on the aesthetics: “discard the first 4 outputs of the rng” is a very old canard and huge red flag. Well,

A full AES is about 8*11 = 88 cycles last time I looked. It’s less than a cache miss. We can use reduced round versions if we want.

I’m pretty sure we can also delay the expensive aes mixing, i.e. push it into the scheduler: When mounting a task, first check whether its rng is mixed, and do so if necessary. This can run run once the Task struct is loaded, ideally in parallel (instruction-level-parallelism) to the CPU stalling on some cache-miss from pointers in the Task.

PS.

I agree on that. My original variant used multiply on each register as well, for that reason.

But I would fear that this algebraically interacts with the PCG output function? Maybe xor some constants into the PCG output, for good measure?

But even then, we should do the old canard “discard first N output values” on the child, in order to allow xoshiros internal mixing to do its job, same as my original variant did.

The “parent stream is undisturbed” is a pretty cool invariant, though!

StefanKarpinski · February 20, 2024, 5:31pm

I tried out changing the perturbation to multiplication here: sk/mulsplit by StefanKarpinski · Pull Request #53408 · JuliaLang/julia · GitHub. This change certainly fixes this specific issue. I’m not convinced that xoring with random output constants after PCG is necessary. I also worked back through the proof from the DotMix paper using a multiplicative “dot product” instead of the additive one and it does go through. Of course, there are half as many multiplicative weights (odd only) as additive weights, but we have so many bits it really doesn’t matter.

foobar_lv2 · February 20, 2024, 5:39pm

This is not what I find objectionable in that paper.

My objection is more fundamental: They construct a family (X_i)_{i\in I} of random number generators, and implicitly claim something like practical independence; but they prove that X_i and X_j are pairwise independent for i \neq j.

Yeah, no cigar. This is like constructing a bunch of vectors, proving pairwise independence, and jumping to the conclusion that the family is independent. Nope, that would require something like rank((X_i)_{i\in M}) = |M| for all finite subsets M\subseteq I. Or the same for independence of random events in statistics.

But my objection still stands: The thing their paper proves (pairwise independence) is completely insufficient for what we want (independent family).

StefanKarpinski · February 20, 2024, 5:42pm

It’s a practical one. You don’t want the fact that some code spawned a child task or not to change the RNG stream—if you call some function and it happens to fork a task, why should that change your RNG stream? That would be—and was in previous Julia versions—not great.

StefanKarpinski · February 20, 2024, 5:57pm

Yes, I definitely see what you mean, but I did want to at least ensure that the multiplicative version was no worse in terms of collision resistance than the additive version. I’d be interested in directly numerically evaluating whether there are any linear relationships in rng states for a large tree of tasks with the multiplicative approach. Measuring linear relationships in the RNG state directly seems better since even though Xoshiro256++ has a fairly trivial output function, it’s not nothing; if there’s no relationship in the state, then there won’t be one in the output.

foobar_lv2 · February 20, 2024, 6:02pm

FWIW, with your patch you of course need to test something like r01*r01 - r00*r11 (or xor or /).

I’m neither convinced that it is necessary. Unfortunately, the burden of proof is the opposite way: We need to be convinced that the resulting construction is good despite the obvious algebraic interaction between PCG-output and multiplication.

I really want to advocate for discarding the first 1-4 outputs of the child RNG: This is very cheap to do, and ensures that the state-vector gets mixed (in the sense that it is a function on 256 bit numbers, not a componentwise function on 4 64 bit numbers).

StefanKarpinski · February 20, 2024, 7:44pm

Yes, that’s the second thing I check. Before that I check the version with addition just to be sure, but of course we don’t expect any relationship there.

StefanKarpinski · February 21, 2024, 6:42am

I’ve spent some time working on / thinking about this. First, whatever we do, we do really want some analogue of the dotmix / splitmix proof to work. Yes, it’s not sufficient, but it is necessary, and it’s something we can actually prove, whereas proving linear independence of arbitrarily many task states seems quite challenging. I believe I’ve managed to whittle down the requirements on a “dot product” accumulation function to the bare minimum.

One of the major problems with the dotmix proof (see p.5 section 3 here) is that it seems, on the face of it to require associativity in order to make the algebraic tricks work. But when it comes to RNGs, a sequence of associative operations is pretty suspect. Fortunately, I believe you can generalize the dotmix construction with any accumulator function with some very general requirements that I’ll give next.

We’ll use the same domain for our state and our weights, call it S. The state here is an accumulator for the “dotmix” construction, but in our application, it also happens to be the state of a single 64-bit register of the main Xoshiro256 RNG. We want to apply an accumulation function to the parent state (which doesn’t change) to combine it with a pseudo-random weight, and get a new accumulator state (which is also used as part of the main RNG state). In our simplified binary dotmix/splitmix, the accumulator function is just +, which makes the state a dot product. As @foobar_lv2 has pointed out, this causes problems with correlated states when the result is also used as Xoshiro state. I’m going to show that we can change the accumulator to be something non-commutative and non-associative.

Let’s write u: S \times S \to S for the update function and we’ll write u(s, w) for the arguments. The requirement is that u must be a bijection with respect to each argument for any fixed value of the other argument. Or, written out:

For all w \in S: s \mapsto u(s, w) is a bijection on S
For all s \in S: w \mapsto u(s, w) is a bijection on S.

I’ll write the “pedigree” or task coordinates of a task as an infinite binary sequence where all but finitely many coordinates are zero, written as t = (t_1, t_2, ...) \in 2^\mathbb{N}; denote the subset of 2^\mathbb{N} with finitely many ones as T. The root task is (0, 0, 0, ...) and its first child is (1, 0, 0, ...) and its second child is (0, 1, 0, ...) and that task’s first child is (0, 1, 1, ...). A child’s coordinates differ by its parents coordinates in only one place—at the coordinate that indicates which child of the parent it is. The number of coordinates two tasks differ by is how far apart they are in the task tree (as a graph).

In order to define the “compression function” there are a few more notations. We have pseudo-random weights w = (w_1, w_2, ...) \in S^\mathbb{N}, which are common across tasks. We also have a function s_0 : 2^\mathbb{N} \to S assigning an initial state to each task. Unexpectedly, this can be completely arbitrary. Define C: 2^\mathbb{N} \to S as:

C_0(t) = s_0(t)
C_i(t) = \mathrm{ifelse}(t_i = 0, C_{i-1}(t), u(C_{i-1}(t), w_i))
C(t) = \lim_{i \to \infty} C_i(t)

This gives the state assigned to each task. In words:

Each task starts with its own arbitrary state value
For each i whether we apply the update function or not depends on t_i:
- if t_i = 0 then we leave the state alone
- if t_i = 1 then we apply the update function with w_i as second argument
Since tasks only have finitely many non-zero coordinates, this becomes constant after the last non-zero coordinate, so the limit is well-defined.

Ok, we’re done with notation stuff. Now to the meat. We want to show that for two distinct tasks t \ne t', the chance of C(t) = C(t') is 1/|S|. Let i be the last coordinate where t and t' differ. If C(t) = C(t') then we can conclude that C_i(t) = C_i(t') based on the fact that u(s, w) = u(s', w) implies s = s' since s \mapsto u(s, w) is a bijection. That leaves us with C_i(t) = C_i(t') and t_i \ne t'_i. Without loss of generality, let t_i = 0 and t'_i = 1. Applying the definition of C_i we have

C_{i-1}(t) = u(C_{i-1}(t'), w_i).

Now, disregard how C_{i-1} is produced and just let s = C_{i-1}(t) and s' = C_{i-1}(t'). This simplifies the above equation to

s = u(s', w_i).

We want this to have a 1/|S| chance, which is precisely what we get since w \mapsto u(s, w) is a bijection, meaning that each output value, s, happens for exactly one input value, w, which has a 1/|S| chance of being the value of w_i. This probability is the same regardless of the first argument, s', although which value is required changes if s' does.

What’s somewhat suprising about this result is that most of the computation of C(t) doesn’t matter at all. It starts with a completely arbitrary initial value, and all we really care about is that at the last point where they differ, the two tasks have some state values—same, different, doesn’t matter!—and then if we leave one alone but apply the update function to the other one with a random weight, it’s unlikely that they’ll end up the same after that, only happening if the random weight happens to be one specific value that depends on the two incoming state values, whatever they may be.

With that proof out of the way, what would be a good update function? We are no longer burdened with any need for commutativity, associativity, etc. Only that it’s a two-argument function S \times S \to S that is a bijection with respect to each argument. At this point I’m just going to say what I think is a good function and then justify it:

function update(s::UInt64, w::UInt64)
    s = bswap(s)
    2s*w + s + w
end

The first step is to show that this function is a bijection with repsect to both s and w. Note:

(2s + 1)(2w + 1) \div 2 = \\ (4sw + 2s + 2w + 1) \div 2 = \\ 2sw + s + w

Odd numbers like 2s+1 and 2w+1 have multiplicative inverses, modulo a power of two like 2^{64}, and the product is guaranteed to be odd, so shifting the last bit away doesn’t discard any information. Explicitly, the inverse of s \mapsto 2sw + s + w is

s' \mapsto (2s'+1)(2w+1)^{-1} \div 2

Substituting s' = u(s, w) to check:

(2\,u(s, w) + 1)(2w + 1)^{-1} \div 2 = \\ (2(2sw + s + w) + 1)(2w + 1)^{-1} \div 2 = \\ (4sw + 2s + 2w + 1)(2w + 1)^{-1} \div 2 = \\ (2s + 1)(2w + 1)(2w + 1)^{-1} \div 2 = \\ (2s + 1) \div 2 = s

Similarly for w. Of course, this is worth testing with a reduced state space:

julia> f(a::UInt8, b::UInt8) = 2a*b + a + b
f (generic function with 1 method)

julia> F = [f(a, b) for a=0x0:0xff, b=0x0:0xff];

julia> all(allunique(c) for c in eachcol(F))
true

Why include the bswap in update? It’s cheap and doesn’t commute nicely with any other algebraic operations, so it’s likely to help trip up any clean algebraic relationships. Multiplication also tends to have more chaotic effects on the high bits, so bswapping between “rounds” mixes around the bits that are most affected by the update function. It’s also applied to the state in parallel to the PCG output computation of the weight before combining them, so it’s likely to be free.

I’ve implemented this update function here (a commit in this PR):

Yes, none of this proves linear independence of N task states, but I rather doubt we can prove that analytically. We should probably test that for up to 256 tasks, we get 256 linearly independent states. Why 256? Because there are only 256 bits in Xoshiro256 states, so after that you’re guaranteed that some subset has a linear relationship.

StefanKarpinski · February 21, 2024, 7:28pm

@foobar_lv2, what are your thoughts on how to empirically test for correlation of task states? We can generate a tree of tasks and have each one either generate a number or dump its RNG state and then see if there’s any relationships among some subset, but I’m not sure what the right test would be. For any non-trivial number of tasks, there are too many potential interactions. Any ideas?

StefanKarpinski · February 21, 2024, 8:30pm

Here’s some code I wrote that generates some weights in the manner of our task RNG system, then simulates generating a full binary tree of tasks and initializes their states as the task system would. It then computes the GF(2) rank of those state values, interpreted as a binary matrix. Here’s the code:

function gen_weights(n::Integer; init::UInt64 = rand(UInt64))
    a = 0x214c146c88e47cb7
    m = 0xaef17502108ef2d9
    function pcg(lcg::UInt64)
        p = lcg + a
        p ⊻= p >> ((p >> 59) + 5)
        p *= m
        p ⊻= p >> 43
    end
    lcg = init
    weights = zeros(UInt64, n)
    for i = 1:n
        weights[i] = pcg(lcg)
        lcg = lcg*0xd1342543de82ef95 + 1
    end
    return weights
end

function gen_states(
    update  :: Function,
    weights :: Vector{UInt64};
    init    :: UInt64 = rand(UInt64),
)
	n = length(weights)
	function state(b::Int)
		s = init
		for i = 1:n
			if isodd(b >> (i-1))
				s = update(s, weights[i])
			end
		end
		return s
	end
	[state(b) for b = 0:2^n-1]
end

function update(s::UInt64, w::UInt64)
    s = bswap(s)
    2s*w + s + w
end

function gf2_rank!(rows::Vector{UInt64})
    rank = 0
    while !isempty(rows)
        pivot_row = pop!(rows)
        if pivot_row != 0
            rank += 1
            lsb = pivot_row & -pivot_row
            for (index, row) in enumerate(rows)
                if row & lsb != 0
                    rows[index] = row ⊻ pivot_row
                end
            end
        end
    end
    return rank
end

gf2_rank(rows::Vector{UInt64}) = gf2_rank!(copy(rows))

function test_rank(update::Function, n::Integer=6)
    weights = gen_weights(n)
    states = gen_states(update, weights)
    gf2_rank!(states)
end

Calling test_rank repeatedly gives a sampling of the resulting matrix ranks, which we want to be close to 64. The update function I proposed does well, but then again so does +. If you pass ⊻ as the update function then it does abysmally—rank 7 every time—as expected. The trouble here is that computing the GF(2) rank of the matrix only measures exact linear correlations, whereas what we actually care about is near linear correlations between far fewer states. After all, with enough states there will always be correlations, so we don’t really care about that.

foobar_lv2 · February 21, 2024, 10:48pm

I think I’d prefer something that works in typical crypto-engineering style.

That is: We have a standard stream-producing PRNG. Alice and Eve play the following game:

Alice seeds the PRNG with true random data. Then she hands over two streams to Eve: One is the PRNG and the other is true random. Eve now looks at these streams, and tries to distinguish them, i.e. guess which is which.

If there is an algorithm for Eve to guess with better than even chance, given typical constraints on computation (no trying 2^128 different guesses!), then this is a failure of the PRNG.

====================
With this definition, we can take a look at the old construction we had.

The old model was the following: We have the forking structure of tasks with local PRNGs. Forking is permitted to update the internal state of the PRNG.

The game now is the following: Eve gets oracle access to two forkable random streams (really now trees, not streams), and tries to distinguish which is true random and which is from the PRNG.

The universal construction we had was the following: On forking, extract output from the PRNG and use that to seed the child.

Now we have the following trivial theorem:

If there is a distinguisher for this tree-PRNG, then there exists a distinguisher for the original stream-PRNG this was based on.

Proof idea: The construction never accesses internal state. Hence Eve can use her oracle access to the stream-PRNG to construct oracle access to the tree-PRNG; if she can distinguish the tree-PRNG then this also distinguishes the stream-PRNG.

In other words: The construction of the tree-PRNG does not introduce new weaknesses. The proof is simple because it only uses oracle access.

It is for this reason that I insisted so strongly that the child PRNG must be seeded by the output of the parent, not by a scrambled version of its internal state.

====================

We now have a more difficult job: We are not permitted to extract from the parent PRNG stream to seed the child. Cool new invariant, yay!

I think the thing boils down to the following universal construction:

We have two independent PRNG states, one for OUTPUT and one for FORK. When forking, we use (and advance!) the FORK rng of the parent, using its output to seed the child’s two RNGs.

The same kind of proof applies: Because this construction only uses oracle access, it cannot spoil.

====================
Now your construction uses PCG for the FORK rng and xoshiro for the OUTPUT rng.

A small problem is that the FORK rng has too small state. We want to mix in the xoshiro state; in a way that cannot spoil the construction.

An easy way would be the following: When forking, we peak at the next 5 output values of the parent’s OUTPUT stream; and we advance the paren’ts FORK rng 5 times. We then mix these 5 values with e.g your update function (if this was crypto, we’d just xor them), and that’s the seed for the child’s OUTPUT and FORK rngs.

If the FORK rng is good, then it masks the otherwise forbidden reuse of the parent’s OUTPUT.

Regardless, we do not access internal state, ever.

Btw, I would still discard the first 4 values of the child rng. Really, we want to make use of the xoshiro avalanche / mixing.

====================

This kind of property is not sufficient for us. The issue is that xoshiro and PCG are very much not cryptographically secure. Hence algebraic interactions are possible, and we need to work to avoid them.

However, I think that our construction should still be universal and secure, with certain semi-principled tweaks to get rid of interactions.

StefanKarpinski · February 22, 2024, 3:18am

There’s an issue with this construction. You take the output of one FORK rng and use that output to seed the child’s FORK rng which is then used to seed the FORK of that task’s child and so on. If you sample one value from each of those tasks, that should be a valid sequence RNG samples. But is feeding the output of an RNG into the state of the same RNG repeatedly a good RNG? Who knows! We don’t have any good reason to expect that to be a good construction with decent statistical behavior. It might have very short period or have totally obvious statistical artifacts.

foobar_lv2 · February 22, 2024, 9:21am

This is not an issue if FORK is indistinguishable from random oracle:

If Eve were able to distinguish the output of iteratively seeded FORKs from true random, then she would be able to use this technique to distinguish FORK itself from true random.

This of course doesn’t invalidate your concern – our FORK and OUTPUT rngs are definitely not cryptographically secure.

What I am saying is: We should aim for a construction that:

Is universal / blackbox, in the sense that it can be applied / parametrized by arbitrary component RNGs.
Our construction should at the very least work if our component RNGs are cryptographically secure. This means: If our component RNGs have no discernible algebraic interactions, then our construction should not introduce any.

So yes, if FORK is not cryptographically secure, then we may need tweaks to stymie algebraic interactions.

The pairwise linear independence proof from the cited paper gives you a good feeling – even though it is completely insufficient to rule out e.g. 4-way correlations.

Similarly, a proof of cryptographic security when parametrized with cryptographically secure PRNGs does give me a good feeling – even though this is completely insufficient to rule out statistic anomalies when parametrized with statistically good but cryptographically insecure component RNGs.

foobar_lv2 · February 22, 2024, 10:13am

I think that accessing internal states is a big no-no. We should only look at the output.

Regarding the testing: It appears clear that the first few outputs after forking are the most dangerous ones.

One test I would definitely do is the following: For all 128 forking sequences in {C,P}^7, extract 4 UInt64 numbers. So this generates 512 numbers. Do it again and again, yielding a random stream.

Feed a couple hundred GB of that into standard test suites.

An obvious tweak to make this more realistic is to use flatter pedigree trees: For example we could instead take all 130 pedigree vectors of length 9 that contain at most 3 P entries.

(If I say “tweak”, I mean: run both, don’t choose one of them!)

This test is unfortunately insufficient to detect collisions in e.g. P E C E and C E. I am unsure how to properly test that – e.g. collisions in P E C E and P C E would be acceptable, because it is impossible to observe both in the same run.

This would become simpler if we decided that such unobservable collisions are still unacceptable – then we could do the same as above, i.e. exhaustively sample all small pedigree words.

StefanKarpinski · March 4, 2024, 6:04pm

Regarding our splittable RNG, I looked at the SplitMix paper and, of course, they apply a non-linear finalizer (MurmurHash3) to their dot product construction. The SplitMix authors are not dumb and they realized what @foobar_lv2 pointed out would be a problem. They did not, however, as far as I found, make much of a point of why it’s so important, possibly—consciously or unconsciously—because it highlights such a deep weakness in the core of the RNG: it’s inherently a linear combination and therefore produces tasks with linear relationships between their RNG states.

Why did I not apply the MurmurHash3 finalizer? Firstly, because I didn’t realize its significance. Secondly, because it requires another field in the task to store the incrementally accumulated dot product, and we really want to keep our tasks as small as possible, so I’ve made the xoshiro256 registers each do double duty as both the state of the main RNG and as the accumulator for four independent dot products. Having four dot products massively increases our collision resistance. If you’re going to compute the dot product and then apply a non-linear finalizer to it, you have to store the dot product somewhere. This would increase our task size and reduce our direct collision resistance. Thirdly, because that would mean applying two non-linear finalizers at task creation time, the steps being:

Advance the internal LCG
Apply the PCG finalizer to the LCG state to get a dot product weight
Add the PCG-generated weight to the dot-product field in the child
Apply the MurmurHash3 finalizer to the dot product
Use that output to perturb the xoshiro256 state

Those two non-linear finalizer steps (2 & 4) are relatively expensive, it would be nice to only do one of them on task construction, instead of two. Also, with this construction we only get one 64-bit output to perturb with, although I’m sure different MurmurHash3 variations could be concocted.

I’ve solved the problem a different way in the final version of this PR that preserves the benefits of my original approach while being massively less linear than SplitMix. It’s based on the observation I made above: the DotMix/SplitMix proof relies heavily on addition, the most linear of all operations, and specifically its associativity; as outlined above, I figured out that we can make a similar proof work as long as the function that combines mix state with mixing weights is doubly bijective. In particular, it’s totally fine if it’s highly non-linear and non-associative. If we do that, we get a much better construction that still has the original collision resistance property.

What’s the specific mixing function? It’s this:

Xor the LCG state with a different random constant for each xoshiro256 register.
Multiplicatively combine the mix state (xoshiro register) with the xored LCG state using this doubly bijective operation: 2*c*w + c + w.
Apply the PCG non-linear output function to that, using a different multiplicative value for each different xoshiro256 register.

The first and last step are obviously bijective. It’s a little non-obvious to see that the middle step is a bijection in each of c and w but it is, which is easier to see if you write it as ((2*widen(c)+1)*(2*widen(w)+1) >> 1) % UInt64.

Anyway, this means that instead of a highly linear dot-product “compression function” like DotMix and SplitMix, we have a highly non-linear compression function that’s “pre-finalized” and doesn’t need additional storage or an additional non-linear finalization step. That lets us use the main RNG state for accumulation, which gives much better collision resistance at no additional cost to task size, and the task creation time is the same as it was in the previous (broken) version.

StefanKarpinski · March 4, 2024, 6:23pm

Oh, one other thing that differs from SplitMix: they use a pre-generated table of 1024 random values as their weights. I don’t do that and instead use an LCG to generate weights. Why? Lugging around 1024 weights in your runtime isn’t great, but not terrible (it’s only 64KB). The bigger issue is what happens after you’ve used more than 1024 weights? You have to reuse them. In SplitMix a new weight is used for each level of the task tree, so this only happens if you have a task tree more than 1024 deep, i.e. a task is spawned with 1024 ancestors. That does seem uncommon (but not impossible). In my construction, we use a new weight every time a child task is forked, which means we’d start reusing weights after a single task forked 1024 children, which is clearly not acceptable.

Using an internal RNG to generate weights on the fly seemed like a reasonable alternative, and an LCG is so simple and efficient; the main issue is that it would be madness to use a raw LCG to generate weights for a linear construction like a dot product—you’re basically guaranteed poor linear independence. My solution previously was to apply the PCG finalizer to delinearize, which worked well enough in my reduced size simulations. The new solution is to make the mix construction itself non-linear, which means that the “weights” having a somewhat simple linear structure seems ok. Like in PCG, the LCG state is just a semi-random input to a more complex non-linear output process.

foobar_lv2 · March 4, 2024, 11:01pm

I would feel much safer if we could just use a CSPRNG for seeding child RNGs and be done with it.

That gives an additional invariant: S E C E == S C E, i.e. a tasks RNG stream depends only on task-structure, not on extractions in the parent.

A simple small-state CSPRNG is aes in counter mode. The key can be globally fixed, and ideally we’d have 16 bytes for the counter. I see

julia> using Random123, BenchmarkTools
julia> function foo(keyref, src, dst)
       @inbounds for i=1:3
       dst[i] = Random123.aesni(keyref[], (src[i],))[1]
       end
       end
julia> src = rand(UInt128, 3); dst = rand(UInt128, 3); k=Ref(ntuple(i->rand(UInt128), 11));
julia> @btime foo($k, $src, $dst)
  25.503 ns (0 allocations: 0 bytes)
julia> f()=Threads.@spawn 1+1;
julia> @btime f()
  151.597 ns (4 allocations: 480 bytes)
Task (done) @0x00007fee7b798010
julia> g()=fetch(f());
julia> @btime g()
  1.243 ��s (4 allocations: 480 bytes)
2

If we defer the initialization of the rng-state, i.e. roll it into the thread that eventually mounts the task, then a small cryptographic function has negligible runtime cost (the parent thread only needs to increment a counter).

For comparison

julia> baz() = rand(UInt64)+rand(UInt64)+rand(UInt64)+rand(UInt64)+rand(UInt64);

julia> @btime baz()
  8.320 ns (0 allocations: 0 bytes)
0x4b6211c18a614077

Dan · March 5, 2024, 12:38am

The goal of easing multithreaded and pseuodo-randomized algorithms to be reproducible is good. How about having a method of setting a different counter for different thread spawning code?

This would create invariants such as: S E C1 C2 C1 E == S E C1 C1 E. It would allow separating random threading operations from the meat of the reproducible calculation.

Topic		Replies	Views
Random seed initialization with splitmix General Usage random	7	551	February 21, 2024
LXM splittable prng Internals & Design	5	430	February 23, 2024
Does Julia have any utility to "split" RNGs for later use in parallel? General Usage random	4	792	June 14, 2022
How to Spawn Child RNGs with Random? General Usage question	5	420	July 26, 2023
Why is `TaskLocalRNG` faster than `Xoshiro` with multiple threads? Performance	7	1323	January 14, 2022

Linear relationship between Xoshiro tasks

Related topics