Linear relationship between Xoshiro tasks

So, one of the things that irritate me in your construction is the following:

Your fork structure has the general shape childXoro[i] = func(i, parentXoro[i], parentLCG). In other words, there is absolutely no mixing across the 4 different state numbers of the parent.

Second: A valid output sequence is n -> S C^n E, i.e. we always look at the first output of the child RNG.

The quality of this sequence is mostly equivalent to the quality of your LCG + mixing.

Now, I propose to consider the following model RNG:

function advance(s0,s1,s2,s3)
  output = s0
  s0 = s1
  ctr = (UInt128(s2)<<64) + s3
  s1 = aes_enc(KEY, ctr) %UInt64
  ctr += 1
  s2 = (ctr>>64) %UInt64
  s3 = ctr % UInt64
  (output, (s0,s1,s2,s3))
end

This is a very good (cryptographic!) PRNG. But it is clear that the first two outputs just dump parts of its internal state.

This model maybe makes analysis clearer, by abstracting away implementation details of the base (xoshiro) RNG.

==============
I took another look at xoshiro. The internet drama between xoshiro and PCG is somewhat ironic because both use a similar structure:

The state update is linear (\mathbb{Z}_2-linear for xoshiro, \mathbb{Z}_64-linear for PCG); and a final nonlinear output mixing function is applied.

So this linear structure allows us to quickly check how fast xoshiro mixes:

julia> function adv!(dst::Vector{UInt64}, s::Vector{UInt64})
       s0, s1, s2, s3 = s
       tmp = s0+s3
       res = (tmp<<23) | (tmp>>41)
       res += s0
       t = s1<<17
       s2 = xor(s2, s0)
       s3 = xor(s3, s1)
       s1 = xor(s1, s2)
       s0 = xor(s0, s3)
       s2 = xor(s0, t)
       s3 = (s3 <<45) | (s3>>19)
       dst[1] = s0
       dst[2] = s1
       dst[3] = s2
       dst[4] = s3
       res
       end
julia> function adv(s::Vector{UInt64})
       res = zeros(UInt64, 4)
       adv!(res, s)
       res
       end
julia> a=rand(UInt64, 4); b=rand(UInt64, 4);

julia> xor.(adv(a), adv(b)) == adv(xor.(a,b)) #verify linearity
true
julia> function onehot(i)
       r=falses(64*4)
       r[i] = true
       r.chunks
       end
julia> function nweight(a::Vector{UInt64})
       w = sum(count_ones(x) for x in a)
       n = 64*length(a)
       min(w, n-w)
       end
julia> v0=onehot(17);for i =0:40
       @show i, nweight(v0)
       adv!(v0, v0);
       end
(i, nweight(v0)) = (0, 1)
(i, nweight(v0)) = (1, 3)
(i, nweight(v0)) = (2, 3)
(i, nweight(v0)) = (3, 9)
(i, nweight(v0)) = (4, 9)
(i, nweight(v0)) = (5, 17)
(i, nweight(v0)) = (6, 20)
(i, nweight(v0)) = (7, 27)
(i, nweight(v0)) = (8, 27)
(i, nweight(v0)) = (9, 35)
(i, nweight(v0)) = (10, 38)
(i, nweight(v0)) = (11, 45)
(i, nweight(v0)) = (12, 45)
(i, nweight(v0)) = (13, 53)
(i, nweight(v0)) = (14, 56)
(i, nweight(v0)) = (15, 57)
(i, nweight(v0)) = (16, 61)
(i, nweight(v0)) = (17, 61)
(i, nweight(v0)) = (18, 60)
(i, nweight(v0)) = (19, 61)
(i, nweight(v0)) = (20, 65)
(i, nweight(v0)) = (21, 65)
(i, nweight(v0)) = (22, 64)
(i, nweight(v0)) = (23, 65)
(i, nweight(v0)) = (24, 69)
(i, nweight(v0)) = (25, 69)
(i, nweight(v0)) = (26, 68)
(i, nweight(v0)) = (27, 69)
(i, nweight(v0)) = (28, 71)
(i, nweight(v0)) = (29, 77)
(i, nweight(v0)) = (30, 76)
(i, nweight(v0)) = (31, 77)
(i, nweight(v0)) = (32, 79)
(i, nweight(v0)) = (33, 85)
(i, nweight(v0)) = (34, 84)
(i, nweight(v0)) = (35, 85)
(i, nweight(v0)) = (36, 87)
(i, nweight(v0)) = (37, 93)
(i, nweight(v0)) = (38, 92)
(i, nweight(v0)) = (39, 93)
(i, nweight(v0)) = (40, 95)

Ok, I made a testing suite. I grabbed pracrand (pre0.95) from sourceforge (seriously? who uses that still?).

Then I used the following script:

import Random
function E((s0,s1,s2,s3,s4))
    tmp = s0+s3
    res = (tmp<<23) | (tmp>>41)
    res += s0
    t = s1<<17
    s2 = xor(s2, s0)
    s3 = xor(s3, s1)
    s1 = xor(s1, s2)
    s0 = xor(s0, s3)
    s2 = xor(s0, t)
    s3 = (s3 <<45) | (s3>>19)
    (res, (s0,s1,s2,s3,s4))
end

function split(s)
    mul =  0xd1342543de82ef95
    a  = (0x214c146c88e47cb7, 0xa66d8cc21285aafa, 0x68c7ef2d7b1a54d4, 0xb053a7d7aa238c61)
    m = (0xaef17502108ef2d9, 0xf34026eeb86766af, 0x38fd70ad58dd9fbb, 0x6677f9b93ab0c04d)
    
    x = s[5] * mul + 1
    child = ntuple(4) do idx
        c = s[idx]
        w = xor(x , a[idx])
        c += w*(2*c + 1)
        c = xor(c, c >> ((c>>59) + 5))
        c *= m[idx]
        c = xor(c, c>>43)
    end
    parent = ntuple(i->s[i], 4)

    ((parent..., x),
    (child..., x))
end

function emit_recurse(buf, state, depth)::NTuple{5, UInt64}
    if depth == 0
        r, state = E(state)
        write(buf, r)
        r, state = E(state)
        write(buf, r)
        state
    else
        parent, child = split(state)
        emit_recurse(buf, child, depth-1)
        emit_recurse(buf, parent, depth-1)
    end
end


function emit_recurse_loop(buf, state, depth, N)
    sz = 0
    while sz < N
        state = emit_recurse(buf, state, depth)
        sz += 1 << (depth+1)
    end
end

function emit_recurse_loop_reseeded(buf, state, depth, N)
    state = ntuple(i->rand(Random.RandomDevice(), UInt64), 5)
    sz = 0
    while sz < N
        emit_recurse(buf, state, depth)
        sz += 1 << (depth+1)
        state = ntuple(i->rand(Random.RandomDevice(), UInt64), 5)
    end
end

function emit_deep_loop(buf, state, N)
    sz = 0
    while sz < N
        parent, state = split(state)
        r, parent = E(parent)
        write(buf, r)
        r, parent = E(parent)
        write(buf, r)
        sz += 2
    end
end

function emit_shallow_loop(buf, state, N)
    sz = 0
    while sz< N
      r, state = E(state)
      write(buf, r)
      sz += 1
    end
end


Base.buffer_writes(stdout)
#julia> ntuple(i->rand(Random.RandomDevice(),UInt64), 5)
#(0xe6bb09b2c315d3da, 0x477b96132c09a3fc, 0xf543c48fbd2bb03f, 0x97becdc7ae6734cb, 0xa56101f92d1777ba)
#
init_state = (0xe6bb09b2c315d3da, 0x477b96132c09a3fc, 0xf543c48fbd2bb03f, 0x97becdc7ae6734cb, 0xa56101f92d1777ba)

And now the tests. First, congrats, you removed the problem:

$ julia -L rngScript.jl -e "emit_recurse_loop_reseeded(stdout, init_state, 7, 1<<30)" | ./RNG_test stdin
RNG_test using PractRand version 0.95
RNG = RNG_stdin, seed = unknown
test set = core, folding = standard(unknown format)

rng=RNG_stdin, seed=unknown
length= 256 megabytes (2^28 bytes), time= 3.1 seconds
  no anomalies in 217 test result(s)

rng=RNG_stdin, seed=unknown
length= 512 megabytes (2^29 bytes), time= 7.2 seconds
  Test Name                         Raw       Processed     Evaluation
  FPF-14+6/16:all                   R=  +4.8  p =  5.3e-4   unusual          
  ...and 231 test result(s) without anomalies

rng=RNG_stdin, seed=unknown
length= 1 gigabyte (2^30 bytes), time= 14.5 seconds
  no anomalies in 251 test result(s)

rng=RNG_stdin, seed=unknown
length= 2 gigabytes (2^31 bytes), time= 27.7 seconds
  no anomalies in 269 test result(s)

rng=RNG_stdin, seed=unknown
length= 4 gigabytes (2^32 bytes), time= 52.6 seconds
  no anomalies in 283 test result(s)

error reading standard input

Second, congrats, the degraded LCG + output-mix mode (where xoshiro is mostly ignored,S C^n E) is also good enough:

$ julia -L rngScript.jl -e "emit_deep_loop(stdout, init_state, 1<<30)" | ./RNG_test stdin
RNG_test using PractRand version 0.95
RNG = RNG_stdin, seed = unknown
test set = core, folding = standard(unknown format)

rng=RNG_stdin, seed=unknown
length= 256 megabytes (2^28 bytes), time= 2.8 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]DC6-9x1Bytes-1           R=  -4.4  p =1-3.6e-3   unusual          
  ...and 216 test result(s) without anomalies

rng=RNG_stdin, seed=unknown
length= 512 megabytes (2^29 bytes), time= 6.7 seconds
  no anomalies in 232 test result(s)

rng=RNG_stdin, seed=unknown
length= 1 gigabyte (2^30 bytes), time= 13.4 seconds
  no anomalies in 251 test result(s)

rng=RNG_stdin, seed=unknown
length= 2 gigabytes (2^31 bytes), time= 25.6 seconds
  no anomalies in 269 test result(s)

rng=RNG_stdin, seed=unknown
length= 4 gigabytes (2^32 bytes), time= 48.3 seconds
  no anomalies in 283 test result(s)

error reading standard input

However, xoshiro itself seems to fail the test battery:

$ julia -L rngScript.jl -e "emit_shallow_loop(stdout, init_state, 1<<30)" | ./RNG_test stdin
RNG_test using PractRand version 0.95
RNG = RNG_stdin, seed = unknown
test set = core, folding = standard(unknown format)

rng=RNG_stdin, seed=unknown
length= 256 megabytes (2^28 bytes), time= 2.5 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(2+0,13-2,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+1,13-2,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+2,13-3,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+3,13-3,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+4,13-3,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+5,13-4,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+6,13-5,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+7,13-5,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+8,13-6,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+9,13-6,T)                  R>+99999  p = 0           FAIL !!!!!!!!  
  BCFN(2+10,13-7,T)                 R=+68222  p = 0           FAIL !!!!!!!!  
  BCFN(2+11,13-8,T)                 R=+40590  p = 0           FAIL !!!!!!!!  
  BCFN(2+12,13-8,T)                 R=+20312  p =  8e-5156    FAIL !!!!!!!!  
  BCFN(2+13,13-9,T)                 R=+11682  p =  5e-2626    FAIL !!!!!!!!  
  BCFN(2+14,13-9,T)                 R= +5834  p =  4e-1312    FAIL !!!!!!!!  
  DC6-9x1Bytes-1                    R>+99999  p = 0           FAIL !!!!!!!!  
  Gap-16:A                          R>+99999  p = 0           FAIL !!!!!!!!  
  Gap-16:B                          R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(0,14-0)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(1,14-0)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(2,14-0)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(3,14-0)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(4,14-1)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(5,14-2)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(6,14-2)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:(8,14-4)              R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:all                   R>+99999  p = 0           FAIL !!!!!!!!  
  FPF-14+6/16:cross                 R>+99999  p = 0           FAIL !!!!!!!!  
  BRank(12):128(4)                  R= +2544  p~=  4e-1354    FAIL !!!!!!!!  
  BRank(12):256(4)                  R= +9433  p~=  7e-5018    FAIL !!!!!!!!  
  BRank(12):384(1)                  R= +6783  p~=  5e-2043    FAIL !!!!!!!!  
  BRank(12):512(2)                  R=+14951  p~=  1e-4501    FAIL !!!!!!!!  
  BRank(12):768(1)                  R=+15738  p~=  8e-4739    FAIL !!!!!!!!  
  BRank(12):1K(2)                   R=+30782  p~=  2e-9267    FAIL !!!!!!!!  
  BRank(12):1536(1)                 R=+32616  p~=  2e-9819    FAIL !!!!!!!!  
  BRank(12):2K(1)                   R=+43896  p~= 0           FAIL !!!!!!!!  
  mod3n(5):(0,9-6)                  R=+33374  p = 0           FAIL !!!!!!!!  
  mod3n(5):(1,9-6)                  R=+13537  p =  2e-4623    FAIL !!!!!!!!  
  mod3n(5):(2,9-6)                  R=+37827  p = 0           FAIL !!!!!!!!  
  mod3n(5):(3,9-6)                  R=+47509  p = 0           FAIL !!!!!!!!  
  mod3n(5):(4,9-6)                  R=+48857  p = 0           FAIL !!!!!!!!  
  mod3n(5):(5,9-6)                  R=+64538  p = 0           FAIL !!!!!!!!  
  mod3n(5):(6,9-6)                  R=+52694  p = 0           FAIL !!!!!!!!  
  mod3n(5):(7,9-6)                  R=+31328  p = 0           FAIL !!!!!!!!  
  mod3n(5):(8,9-6)                  R=+24550  p =  5e-8384    FAIL !!!!!!!!  
  mod3n(5):(9,9-6)                  R=+16680  p =  1e-5696    FAIL !!!!!!!!  
  mod3n(5):(10,9-6)                 R=+10502  p =  3e-3587    FAIL !!!!!!!!  
  mod3n(5):(11,9-6)                 R= +5197  p =  1e-1775    FAIL !!!!!!!!  
  mod3n(5):(12,9-6)                 R= +2549  p =  2.1e-871   FAIL !!!!!!!   
  mod3n(5):(13,9-6)                 R= +1236  p =  6.9e-423   FAIL !!!!!!!   
  mod3n(5):(14,9-6)                 R=+584.4  p =  2.1e-200   FAIL !!!!!!    
  mod3n(5):(15,9-6)                 R=+263.8  p =  6.4e-91    FAIL !!!!!     
  TMFn(2+0):wl                      R>+99999  p~=  0          FAIL !!!!!!!!  
  TMFn(2+1):wl                      R=+72244  p~=  0          FAIL !!!!!!!!  
  TMFn(2+2):wl                      R=+36073  p~=  0          FAIL !!!!!!!!  
  TMFn(2+3):wl                      R=+17987  p~=  0          FAIL !!!!!!!!  
  [Low1/8]BCFN(2+0,13-4,T)          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+1,13-4,T)          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+2,13-5,T)          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+3,13-5,T)          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+4,13-5,T)          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+5,13-6,T)          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+6,13-6,T)          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+7,13-7,T)          R=+69900  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+8,13-8,T)          R=+41294  p = 0           FAIL !!!!!!!!  
  [Low1/8]BCFN(2+9,13-8,T)          R=+20560  p =  6e-5219    FAIL !!!!!!!!  
  [Low1/8]BCFN(2+10,13-9,T)         R=+11783  p =  1e-2648    FAIL !!!!!!!!  
  [Low1/8]BCFN(2+11,13-9,T)         R= +5870  p =  4e-1320    FAIL !!!!!!!!  
  [Low1/8]DC6-9x1Bytes-1            R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]Gap-16:A                  R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]Gap-16:B                  R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:(0,14-0)      R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:(1,14-1)      R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:(2,14-2)      R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:(3,14-2)      R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:(4,14-3)      R=+88309  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:(5,14-4)      R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:(7,14-5)      R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:all           R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]FPF-14+6/16:cross         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/8]BRank(12):128(4)          R= +4955  p~=  2e-2636    FAIL !!!!!!!!  
  [Low1/8]BRank(12):256(2)          R= +7523  p~=  1e-2265    FAIL !!!!!!!!  
  [Low1/8]BRank(12):384(1)          R= +7989  p~=  7e-2406    FAIL !!!!!!!!  
  [Low1/8]BRank(12):512(2)          R=+15377  p~=  5e-4630    FAIL !!!!!!!!  
  [Low1/8]BRank(12):768(1)          R=+16341  p~=  3e-4920    FAIL !!!!!!!!  
  [Low1/8]BRank(12):1K(1)           R=+21917  p~=  1e-6598    FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(0,9-6)          R=+35339  p = 0           FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(1,9-6)          R=+40584  p = 0           FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(2,9-6)          R=+56010  p = 0           FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(3,9-6)          R=+55194  p = 0           FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(4,9-6)          R=+32167  p = 0           FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(5,9-6)          R=+34369  p = 0           FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(6,9-6)          R=+21702  p =  9e-7412    FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(7,9-6)          R=+10792  p =  5e-3686    FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(8,9-6)          R= +5342  p =  6e-1825    FAIL !!!!!!!!  
  [Low1/8]mod3n(5):(9,9-6)          R= +2622  p =  4.1e-896   FAIL !!!!!!!   
  [Low1/8]mod3n(5):(10,9-6)         R= +1267  p =  1.7e-433   FAIL !!!!!!!   
  [Low1/8]mod3n(5):(11,9-6)         R=+594.7  p =  6.2e-204   FAIL !!!!!!    
  [Low1/8]mod3n(5):(12,9-6)         R=+263.8  p =  6.4e-91    FAIL !!!!!     
  [Low1/8]TMFn(2+0):wl              R=+17987  p~=  0          FAIL !!!!!!!!  
  [Low4/32]BCFN(2+0,13-4,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+1,13-4,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+2,13-5,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+3,13-5,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+4,13-5,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+5,13-6,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+6,13-6,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+7,13-7,T)         R=+67430  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+8,13-8,T)         R=+40257  p = 0           FAIL !!!!!!!!  
  [Low4/32]BCFN(2+9,13-8,T)         R=+20194  p =  7e-5126    FAIL !!!!!!!!  
  [Low4/32]BCFN(2+10,13-9,T)        R=+11634  p =  3e-2615    FAIL !!!!!!!!  
  [Low4/32]BCFN(2+11,13-9,T)        R= +5817  p =  3e-1308    FAIL !!!!!!!!  
  [Low4/32]DC6-9x1Bytes-1           R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]Gap-16:A                 R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]Gap-16:B                 R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:(0,14-0)     R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:(1,14-1)     R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:(2,14-2)     R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:(3,14-2)     R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:(4,14-3)     R=+88309  p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:all          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]FPF-14+6/16:cross        R>+99999  p = 0           FAIL !!!!!!!!  
  [Low4/32]BRank(12):128(4)         R= +4955  p~=  2e-2636    FAIL !!!!!!!!  
  [Low4/32]BRank(12):256(2)         R= +7523  p~=  1e-2265    FAIL !!!!!!!!  
  [Low4/32]BRank(12):384(1)         R= +7989  p~=  7e-2406    FAIL !!!!!!!!  
  [Low4/32]BRank(12):512(2)         R=+15377  p~=  5e-4630    FAIL !!!!!!!!  
  [Low4/32]BRank(12):768(1)         R=+16341  p~=  3e-4920    FAIL !!!!!!!!  
  [Low4/32]BRank(12):1K(1)          R=+21917  p~=  1e-6598    FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(0,9-6)         R=+63820  p = 0           FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(1,9-6)         R=+45790  p = 0           FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(2,9-6)         R=+59474  p = 0           FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(3,9-6)         R=+54080  p = 0           FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(4,9-6)         R=+36751  p = 0           FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(5,9-6)         R=+25215  p =  4e-8611    FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(6,9-6)         R=+17137  p =  8e-5853    FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(7,9-6)         R=+10792  p =  5e-3686    FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(8,9-6)         R= +5342  p =  6e-1825    FAIL !!!!!!!!  
  [Low4/32]mod3n(5):(9,9-6)         R= +2622  p =  4.1e-896   FAIL !!!!!!!   
  [Low4/32]mod3n(5):(10,9-6)        R= +1267  p =  1.7e-433   FAIL !!!!!!!   
  [Low4/32]mod3n(5):(11,9-6)        R=+594.7  p =  6.2e-204   FAIL !!!!!!    
  [Low4/32]mod3n(5):(12,9-6)        R=+263.8  p =  6.4e-91    FAIL !!!!!     
  [Low4/32]TMFn(2+0):wl             R=+17987  p~=  0          FAIL !!!!!!!!  
  [Low1/32]BCFN(2+0,13-5,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+1,13-5,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+2,13-6,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+3,13-6,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+4,13-6,T)         R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+5,13-7,T)         R=+71158  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+6,13-8,T)         R=+41820  p = 0           FAIL !!!!!!!!  
  [Low1/32]BCFN(2+7,13-8,T)         R=+20745  p =  7e-5266    FAIL !!!!!!!!  
  [Low1/32]BCFN(2+8,13-9,T)         R=+11858  p =  1e-2665    FAIL !!!!!!!!  
  [Low1/32]BCFN(2+9,13-9,T)         R= +5896  p =  5e-1326    FAIL !!!!!!!!  
  [Low1/32]DC6-9x1Bytes-1           R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]Gap-16:A                 R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]Gap-16:B                 R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(0,14-2)     R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(1,14-2)     R=+68083  p = 0           FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(2,14-3)     R=+88309  p = 0           FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(5,14-5)     R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:all          R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:cross        R>+99999  p = 0           FAIL !!!!!!!!  
  [Low1/32]BRank(12):128(2)         R= +3687  p~=  7e-1111    FAIL !!!!!!!!  
  [Low1/32]BRank(12):256(2)         R= +7614  p~=  4e-2293    FAIL !!!!!!!!  
  [Low1/32]BRank(12):384(1)         R= +8118  p~=  9e-2445    FAIL !!!!!!!!  
  [Low1/32]BRank(12):512(1)         R=+10895  p~=  1e-3280    FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(0,9-6)         R=+44055  p = 0           FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(1,9-6)         R=+35130  p = 0           FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(2,9-6)         R=+40004  p = 0           FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(3,9-6)         R=+27456  p =  2e-9376    FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(4,9-6)         R=+18673  p =  4e-6377    FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(5,9-6)         R=+11774  p =  1e-4021    FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(6,9-6)         R= +5838  p =  2e-1994    FAIL !!!!!!!!  
  [Low1/32]mod3n(5):(7,9-6)         R= +2870  p =  7.3e-981   FAIL !!!!!!!   
  [Low1/32]mod3n(5):(8,9-6)         R= +1391  p =  7.4e-476   FAIL !!!!!!!   
  [Low1/32]mod3n(5):(9,9-6)         R=+656.8  p =  4.0e-225   FAIL !!!!!!    
  [Low1/32]mod3n(5):(10,9-6)        R=+294.8  p =  1.6e-101   FAIL !!!!!     

ERROR: IOError: write: broken pipe (EPIPE)
Stacktrace:
 [1] uv_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)

This obviously implies that the recursive variant without reseeding must also fail.

5 Likes

FWIW, I also implemented the old version, in order to verify that practrand finds the problem (who tests the tests?).

Snippet:

function split(s) #oldsplit
    mul =  0xd1342543de82ef95
    a  = (0xe5f8fa077b92a8a8, 0x7a0cd918958c124d, 0x86222f7d388588d4, 0xd30cbd35f2b64f52)
    m = (0xaef17502108ef2d9, 0xf34026eeb86766af, 0x38fd70ad58dd9fbb, 0x6677f9b93ab0c04d)
    x = s[5] * mul + 1
    child = ntuple(4) do idx
        c = x + a[idx]
        c = xor(c, c >> ((c>>59) + 5))
        c *= m[idx]
        c = xor(c, c>>43)
        s[idx] + c
    end
    parent = ntuple(i->s[i], 4)

    ((parent..., x),
    (child..., x))
end

result:

$ julia -L rngScript.jl -e "emit_recurse_loop_reseeded(stdout, init_state, 7, 1<<30)" | ./RNG_test stdin
RNG_test using PractRand version 0.95
RNG = RNG_stdin, seed = unknown
test set = core, folding = standard(unknown format)

rng=RNG_stdin, seed=unknown
length= 256 megabytes (2^28 bytes), time= 3.0 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]BCFN(2+0,13-4,T)          R= +67.9  p =  1.3e-29    FAIL !!!       
  [Low1/8]BCFN(2+1,13-4,T)          R= +67.7  p =  1.5e-29    FAIL !!!       
  [Low1/8]BCFN(2+2,13-5,T)          R= +59.3  p =  2.0e-23    FAIL !!        
  [Low1/8]BCFN(2+3,13-5,T)          R= +28.7  p =  2.1e-11   VERY SUSPICIOUS 
  [Low1/8]DC6-9x1Bytes-1            R= +14.7  p =  4.3e-9    VERY SUSPICIOUS 
  [Low4/32]BCFN(2+0,13-4,T)         R= +84.4  p =  8.2e-37    FAIL !!!       
  [Low4/32]BCFN(2+1,13-4,T)         R= +72.9  p =  8.4e-32    FAIL !!!       
  [Low4/32]BCFN(2+2,13-5,T)         R= +64.6  p =  1.8e-25    FAIL !!        
  [Low4/32]BCFN(2+3,13-5,T)         R= +21.0  p =  2.1e-8   very suspicious  
  [Low4/32]DC6-9x1Bytes-1           R= +31.2  p =  3.1e-20    FAIL !!        
  [Low1/32]BCFN(2+0,13-5,T)         R=+544.7  p =  2.0e-213   FAIL !!!!!!    
  [Low1/32]BCFN(2+1,13-5,T)         R=+163.5  p =  3.6e-64    FAIL !!!!      
  [Low1/32]BCFN(2+2,13-6,T)         R= +24.4  p =  8.5e-9   very suspicious  
  [Low1/32]DC6-9x1Bytes-1           R=+513.1  p =  7.1e-271   FAIL !!!!!!    
  [Low1/32]Gap-16:A                 R=+710.6  p =  1.5e-610   FAIL !!!!!!!   
  [Low1/32]Gap-16:B                 R= +4166  p =  1e-3380    FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(0,14-2)     R= +3772  p =  6e-3299    FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(1,14-2)     R= +3733  p =  1e-3264    FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(2,14-3)     R= +2654  p =  5e-2326    FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(3,14-4)     R= +1885  p =  2e-1540    FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(4,14-5)     R= +1346  p =  8e-1116    FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:(5,14-5)     R=+661.4  p =  3.7e-548   FAIL !!!!!!!   
  [Low1/32]FPF-14+6/16:(6,14-6)     R=+461.4  p =  3.7e-353   FAIL !!!!!!!   
  [Low1/32]FPF-14+6/16:(7,14-7)     R=+333.4  p =  3.6e-265   FAIL !!!!!!    
  [Low1/32]FPF-14+6/16:(8,14-8)     R=+247.7  p =  2.7e-178   FAIL !!!!!!    
  [Low1/32]FPF-14+6/16:(9,14-8)     R=+114.8  p =  1.2e-82    FAIL !!!!      
  [Low1/32]FPF-14+6/16:(10,14-9)    R= +90.8  p =  1.7e-57    FAIL !!!!      
  [Low1/32]FPF-14+6/16:(11,14-10)   R= +49.5  p =  6.3e-27    FAIL !!        
  [Low1/32]FPF-14+6/16:(12,14-11)   R= +20.5  p =  8.9e-10  very suspicious  
  [Low1/32]FPF-14+6/16:all          R= +6354  p =  7e-5746    FAIL !!!!!!!!  
  [Low1/32]FPF-14+6/16:cross        R= +40.0  p =  7.8e-33    FAIL !!!       
  ...and 186 test result(s) without anomalies

ERROR: IOError: write: broken pipe (EPIPE)

I am somewhat irritated that practrand complains about vanilla xoshiro and the bad 4-way correlations in the same voice. I’m unqualified to judge which failures out of the test battery are actually fatal.

2 Likes

I have to say that my initial reaction was that using a CSPRNG would be way too slow, but I timed your foo function on my machine and it is amazingly fast—foo clocks in a 7.291 ns! Of course, my hardware has dedicated AES instructions, which are seen in the code native for foo:

	aese.16b	v17, v19
	aesmc.16b	v17, v17
	aese.16b	v17, v18
	aesmc.16b	v17, v17
	aese.16b	v17, v0
	aesmc.16b	v17, v17
	aese.16b	v17, v1
	aesmc.16b	v17, v17
	aese.16b	v17, v2
	aesmc.16b	v17, v17
	aese.16b	v17, v3
	aesmc.16b	v17, v17
	aese.16b	v17, v4
	aesmc.16b	v17, v17
	aese.16b	v17, v5
	aesmc.16b	v17, v17
	aese.16b	v17, v6
	aesmc.16b	v17, v17

Unfortunately, I suspect that this won’t be nearly so fast on other hardware. Indeed, your timing of 151.597 ns is unacceptably slow for task forking.

Moving the initialization from the parent task into the child task doesn’t reduce the amount of work, it just moves it around—the overall throughput will still take a hit in code that’s making heavy use of threading. We could make the initialization lazy and only do it when and if the child task actually uses its RNG, but that requires checking an initialization flag on every RNG usage, which could potentially slow down all RNG usage, but could be worth exploring. If the check can be branch predicted and hoisted out of loops, then it may be ok. I don’t, however, know a way to convey to Julia’s compiler that a bit can only be changed from zero to one and once it’s been shown to be one it can assumed to be so in all following code.

That’s a lot of bytes and makes the task even bigger. If it’s a straight child counter that starts at zero, you’ll never get to most of those bits anyway, so they don’t do any good. 16 bytes is only useful if you initialize the counter with random data at fork time based on the parent’s RNG state. It’s possible that’s what you’re suggesting. In that case the design would be to initialize by applying the CSPRNG to the parent’s counter and filling the child’s RNG state and fork counter with the CSPRNG output. However, you still wouldn’t have as much collision resistance as if you used a smaller counter and included the parent’s main RNG state in your CSPRNG input, which has the additional benefit of not requiring extra storage. We currently have 32 bytes of collision resistance for example, as compared to only 16 bytes you’re proposing (still plenty).

What I can see being compelling—if we can establish that AES is fast enough on hardware that we care about, which is a big “if”—is using a 64-bit fork counter and cryptographically hashing/encrypting all or some of the main RNG state as well as the fork counter when seeding the child task. That would have very high collision resistance without requiring any more state than we currently have. Of course, this design is very similar to what we’re currently doing anyway. The CSPRNG design makes the fork state simple (a counter) and the output function very good (AES). The current design uses a moderately good fork state (LCG) with a moderately good output function (PCG-RXS-M-XS-64).

I don’t understand this notation, which seems to have just been introduced in this thread without any explanation. Can you explain what this means?

The bottom line is that if you’re going to make the child task’s initial RNG state depend only on parent’s fork state and not on the main RNG state, then the amount of fork state is an entropic bottleneck and you can only resist collisions up to that amount. @foobar_lv2 has suggested making the counter 16 bytes. I’m not convinced it’s actually that helpful for child tasks to not be affected by the parent’s RNG state. Is it a nice property if it can be done for free? I guess, although I would actually find it somewhat surprising. Of course, the issue is that it’s not free and it forces the fork state to be larger since 64 bits is insufficient. I don’t think it’s worth adding 64 bits to the task size.

Full mixing across main RNG registers is a non-goal. The only goal of the task split is to make sure that there are no observable correlations between related tasks. As long as a collision in one register doesn’t increase the chance of a collision in another register, they’re doing their job.

Indeed, the drama is really silly. They are both quite good, each with their issues. If you need 64-bit output with only 64-bit state, PCG-RXS-M-XS-64 seems to be the best option. If you can afford more state, you really should and for a general purpose PRNG, it’s a must: otherwise the PRNG being full cycle (which it should be) implies that you’ll never get repeat values, which is something you would expect to see after around 2^32 values (Birthday Paradox again), which is well within reach of observability. With a larger state, multiple states project to the same output, which means you do see the same output multiple times without the entire sequence repeating.

When you have larger state then Xoshiro128/256** seem like the best options. The issue with larger versions of PCG is that they require wide multiplies, which can be slow on some hardware, whereas the xors, shifts and rotates that Xoshiro uses are universally available and fast. As far as I can tell, that’s really the main issue with the larger PCG variants.

On the other hand, Xoshiro128/256 have severe problems with some parts of their state space: when they cross regions with very few bits set (they must never be in the all-zeros state), they behave quite badly—and there are a lot of these regions. In practice, this is mitigated by the large state space (2^128 or 2^256) making it vastly unlikely to hit these regions given good seeding. Of course, it would be even better not to have these issues in the first place.

I’m happy to hear that my task splitting approach is passing tests.

This is concerning. Your initial state seems like a good one, so it would be good to figure out what’s going on here. If this is indeed just Xoshiro being observably bad, I’m very open to changing PRNGs, although it’s a pain in the butt because of all the specialized manually vectorized versions of the RNG that have to be implemented in the Random stdlib to be competitive in arrays.

2 Likes

Glad that you mention that, because 150ns is the time for task forking (f()=Threads.@spawn 1+1) on my laptop, compared to 25 ns for the aes-ctr initialization. So if we don’t defer rand initialization, then task forking gets more expensive from 150ns → 175ns.

If we think that this is bad (task forking is likely on the thread doing critical work!), then we can also defer it to the mounting thread.

The entire roundtrip currently costs 1250 ns. Roundtrip consists of: Spawn a task, get it mounted on a thread, await its result, continue running. (g() = fetch(f())). Including random initialization would make that slower from 1250ns → 1275ns. That’s a 2% overhead in the very worst case (spawning lots of tasks that literally do nothing).

All modern CPU (and GPU!) have dedicated AES instructions. Raspberry Pi users might get hosed, though.

And I don’t envy the person who would have to maintain the inline-assembly for e.g. POWER8 and POWER9. And I 100% don’t envy the persons who would need to make the build system emit the aes stuff with function multi-versioning. When building for 32bit windows or something insane.

It possibly moves the work from the critical (sequential bottleneck!) thread into a background task.

More to the point, it changes the denominator for measuring overhead: Is it 25ns / 150ns (only consider parent!) or 25ns / 1250ns (consider work done to mount child).

8 bytes are already taken by LCG. I spot 6 bytes of structure padding. 14 bytes are enough collision resistance, we can yolo it and not increase task size at all :wink:

But seriously, adding an extra 8 bytes to a several 100 byte structure is not too bad.

Yeah. In view of the positive test results on your construction, and the predictable implementation issues (annoying build and packaging issues with FMV), I think this would only be worth it if we could reap the following main benefit:

With a CSPRNG, the entire construction is beyond reasonable doubt. It is easy to reason about, it is easy to explain to people, it is easy to audit. No literature references or deep thought needed. Caesar’s wife must be above suspicion.


My big dream for RNG would be a reduced-round aes-based CSPRNG that replaces xoshiro. This would squash entire bug-classes (misuse of rand for cryptographic purposes), and elevate our entire random system above suspicion.

Unfortunately we’re not there. I am unqualified to design a CSPRNG, and aes has pretty large latency on current cpu :frowning:

(afaiu @btime allows the superscalar CPU to run multiple foo() invocations in parallel, no speculation fence in the middle. So @btime measures throughput in sequential code, not latency. This is exacerbated by the giant monster reorder buffer of your presumed apple silicon macbook)

2 Likes

This really feels like a shell game. 25ns is 25ns. Time spent in a child task isn’t free somehow. Consider the case where there’s only one thread of execution and the work is immediately done on the same thread… then this doesn’t help at all.

It’s also unclear to me how to defer work to when the child task is run, but only the first time. It’s easy to do work done only once when the task is created—you’re creating the task, so you know the initialization work has to be done and you do it and then put the task, ready to run, in the work pool. If you put an uninitialized task in a work queue then you need to check if it needs initialization every time it’s going to do some work. Maybe that’s trivial enough, maybe not. I truly don’t know. The other option would be having a separate place to put uninitialized tasks where you know they have to be initialized before they can do any work.

The current task_t size is way too big though and badly needs to be slimmed down. I’m trying not to add irremovable space to it. Our current task fork and fetch times are also too big and need to be cut down massively if our multithreading is going to really fly. The consideration here isn’t just for what we do now but that we don’t want to add work that we cannot optimize/eliminate later. It would especially be bad to promise that the child task RNG is unaffected by parent task RNG if we later decide that we don’t want to keep that promise. It’s easy to go one way and hard to go the other way.

Regarding Xoshiro256++ failing tests… it seems like it’s expected? Which seems bad. It seems like Xoshiro256** is the one that passes tests. Between this and the bad zero regions, I really feel like we may need something better.

2 Likes

(S - seed, E - extract, C - child)

The notation means, the program can spawn threads with counter 1 (C1) and counter 2 (C2). Spawning threads with one counter does not influence the other. Useful for when you have a big calculation using threads which requires reproducibility, and some other activity (say, logging / reporting) which is also multithreaded but you don’t want it to interfere with random numbers of big calculation.

Perhaps this is redundant and a different RNG should be used for second work load. So requirement is CPRNG counter be specific to RNG (each RNG has its own counter).

Apologies. I somewhat explained it in the initial post.

I am describing the operations that lead to current state of the RNG. Instead of binary trees, I’m borrowing finitely represented groups, i.e. I describe the operations in a word.

The operations are S for seeding, C for forking and continuing as the child, P for forking and continuing as the parent, and E for extracting. So when I write that S E C E == S C E, then I mean that Random.seed!(u); rand(UInt64); fetch(Threads.@spawn rand(UInt64)) == Random.seed!(u); fetch(Threads.@spawn rand(UInt64)).

True. It is a suspicious red flag, but practrand seems happy enough with your construction. I really want to stress that it is not the only goal of task split to avoid observable correlations – it is also a goal to obviously avoid observable correlations. Caesars wife and so on.

That is not too bad, maybe one day of work, now that we know the general framework. If we need fancy aes instructions then I expect much more wasted time (oh my god, the llvm intrinsics for x86 aes instructions are terribly designed and documented. And I have not even looked at the arm and power intrinsics. Oh, and there are other vectorized aes intrinsics. argh).

25ns of extra overhead can be a catastrophe when put onto something that costs 12.5ns (three times as slow!), and it can be close to negligible when put onto something that costs 5000 ns (0.5% slower). The denominator is important for judging the 25ns.

Depending on where we put the initialization, the 25ns need to be judged against either 150ns or against 1250ns. This is a big difference!

What about istaskstarted?

No worries, I missed it. Thanks for explaining. This is a good notation now that I understand it.

Yes, I definitely think that’s a case for having separate RNG objects that are explicitly passed around. Having a default RNG per task is already a luxury that most languages with serious multithreading wouldn’t afford you; Julia does it because random number generation and simulation is such a key application area. Having support for two independent per-task RNG streams is too far. How do you indicate which one you want? Do we need two different versions of each RNG-less random method? Or is there a keyword with a number? At that point you might as well pass an RNG argument instead.

The purpose of perturbing the main RNG state is to land in a different and hopefully distant part of the main RNG’s state cycle. It’s not necessarily better if you can get from any state to any other state. For example, it might be better to leave part of the RNG state untouched entirely and only perturb the other part of the state. E.g. if you have a construction like LXM you might want to scramble just the LCG state or just the XBG state: since the two generators have coprime periods, leaving one generator the same and changing the other guarantees a jumping a large distance in the cycle of the full RNG. We’re not using that kind of construction (although I think we maybe should), but it shows that the aim here isn’t the same as hashing or encryption.

The way I’m scrambling the main RNG state is intentionally dissimilar to the way the main RNG operates so that it is unlikely that you end up anywhere related in the main RNG’s cycle. Another approach would be to intentionally use the structure of the main RNG to jump somewhere we know to be far away, and it is possible to jump Xoshiro RNGs; but you still have the problem of making sure different tasks jump by different amounts and the consensus of the state of the art seems to be to seed the child’s main RNG randomly and rely on the sheer size of the RNG state space.

Aside: LXM has an additive offset field that doesn’t participate in RNG updates but which effectively parameterizes the RNG, giving different RNG cycles. However, I don’t see the benefit of this: it takes up space in the RNG instance and if you are going to spend that space in the RNG you might as well increase the RNG period with that additional state—having a 2^64 times longer period is just as effective in terms of avoiding task collisions as having 2^64 different RNG cycles.

3 Likes