Trivial port from C++/Rust to Julia

I’m trying to port a very trivial MC simulation from C++ / Rust to Julia. Here are the 3 implementations:
C++: (clang OSX) https://pastebin.com/HWEUxa6W
Rust: https://pastebin.com/7d31ynJE
Julia: https://pastebin.com/NCxnhWt1

Commands:

time --show-output -M 5 'julia --optimize=3 -O3 trnbias.jl 0 1000 0.01 1000'

(i assume that was jitted already since i ran it a hundred times)
Time (mean ± σ): 116.358 s ± 22.229 s [User: 110.601 s, System: 5.023 s]
Range (min … max): 91.366 s … 150.880 s 5 runs

time --show-output -M 5 'cargo run --release -q --bin trnbias -- 0 1000 0.01 1000'

(pre built so cargo run does the build step but its mainly a noop)
Time (mean ± σ): 70.857 s ± 10.696 s [User: 69.616 s, System: 0.379 s]
Range (min … max): 56.655 s … 83.764 s 5 runs

clang++ -Wall -O3 -g -lcurses -std=c++11 TrnBias.CPP -o bin/TRNBIAS
time --show-output -M 5 'bin/TRNBIAS 0 1000 0.01 1000'

Mean IS=0.0453 OOS=-0.0004 Bias=0.0457 Time (mean ± σ): 43.332 s ± 0.633 s [User: 42.676 s, System: 0.237 s]
Range (min … max): 42.837 s … 44.376 s 5 runs

Both julia and rust have bounds checking and over/under flow checks, C++ does not. I assume Rust and Julia have to basically be on par on this test, what am I missing ? Also there’s quite a few GC triggers and allocations in the Julia code, does it all come from the print statements, thats quite a lot, I cant see what else might allocate. Using hyperfine to time

2 Likes

My understanding is that it’s jitted each time because the compiled code is discarded when a Julia process ends. That’s why command line workflows are avoided, and why BenchmarkTools.jl is used to measure runtime performance in a persistent process. Compiled code can be cached, but it involves finer control via precompilation in packages or PackageCompiler.jl executables.

5 Likes

Welcome to the discourse!

If you are new to Julia, the chances are high you don’t have good model of what is and isn’t performant in Julia. A good start are the performance tips in the manual.

Looking briefly at your code, I don’t see any obvious mistakes though. So I think it might just be the way you time things. As @Benny mentioned, Julia will compile your code everytime the session is restarted. The compiler of course will allocate quite a bit, so perhaps these are the allocations you mentioned. I suggest profiling your code to see where it spends most of its time and where the allocations occur. This is best done from an interactive session and not via a cmd/script approach. If you use VSCode you can simply type into the REPL

julia> @profview optimize(which, ncases, save_trend, nreps)

to get a flamegraph that tells your where your code spent time.

3 Likes

I started a Julia session, included your source code and ran @benchmark optimize(00, 1000, 0.01, 1000) just to see the performance of the function call (as others pointed out, timing the whole Julia process always has some compilation overhead).

I get

julia> include("trnbias.jl")

julia> @benchmark optimize(0, 1000, 0.01, 1000)
...
...
...
Mean IS=0.04735720711353852  OOS=0.009247068550281812  Bias=0.03811013856325671
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 81.855 s (0.69% GC) to evaluate,
 with a memory estimate of 33.89 GiB, over 41301809 allocations.

This was without any optimisation flags. I get the same running time with --optimize=3 -O3. The C++ version with -O3 takes 55 seconds on my machine (M1 Mac).
This method is a fair comparison and yields 82s Julia vs 55s C++.

I think that most of the allocations come from the random number generation, but that’s just a guess. Profiling will help a lot :wink:

I’d recommend using less iterations for benchmarking, getting rid of the prints:

julia> include("trnbias.jl")
Usage: julia script.jl <which> <ncases> <trend> <nreps>

julia> @benchmark optimize(0, 100, 0.01, 100)
BenchmarkTools.Trial: 26 samples with 1 evaluation.
 Range (min … max):  192.422 ms … 195.290 ms  ┊ GC (min … max): 2.84% … 2.60%
 Time  (median):     193.281 ms               ┊ GC (median):    3.11%
 Time  (mean ± σ):   193.323 ms ± 461.032 μs  ┊ GC (mean ± σ):  3.08% ± 0.11%

                 ▂  ▅ █
  ▅▁▁▁▁▁▁▁▁▁▁▅▅▁███▅███▁█▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  192 ms           Histogram: frequency by time          195 ms <

 Memory estimate: 434.29 MiB, allocs estimate: 954288.

Reading the code a bit more carfeul, I found some allocations in line 91/92

short_sum = sum(x[i - ishort + 1:i + 1])
long_sum = sum(x[i - ilong + 1:i + 1])

Slicing an array always makes a copy, but here you could use views instead. To do that just prepend @views e.g. like

short_sum = @views sum(x[i - ishort + 1:i + 1])
long_sum = @views sum(x[i - ilong + 1:i + 1])
3 Likes

Benchmarking code that prints so much is pointless in any language. You should first adjust all versions of your code to collect and return the results instead of printing them. You can print the results once in the end or at least not so often.

FTR I notice you have a using Distributions in the beginning but don’t use the package, I think. This shouldn’t matter for performance on a recent release of Julia, just saying.

1 Like

By removing some redundant reads of x, I see around 12% performance boost in my machine

xi     = x[i]
x_next = x[i+1]
if i == ilong
    short_sum = sum(x[j] for j in i - ishort + 1:i + 1)
    long_sum = sum(x[j] for j in i - ilong + 1:i + 1)
else
    short_sum += x_next - x[i - ishort]
    long_sum += x_next - x[i - ilong]
end

short_mean = short_sum / ishort
long_mean = long_sum / ilong

ret = (short_mean > long_mean) ? x_next - xi : 
      (short_mean < long_mean) ? xi - x_next : 0.0
1 Like

Yeah i tried without obviously just shared the full code, FYI the time difference was constant with and without print which was somewhat expected as they make the same exact buffer and syscall most likely.

Is there a way to save the JIT output and skip the julia startup / load time but still use trivial tools to time things, like time as in can i run the julia process with some parameters such that it warm starts from a saved JIT state and i can run time as usual instead of relying on internal libraries, tools and annotations ?

It is indeed very comparable with Rust when using @btime. Still slower but in the same realm.

The profiler doesn’t help at all it states some pretty obvious facts, no real allocation issues. The @view suggestion is only ran once, in the first iteration so it’s kinda drop in the bucket, the move of the xi and xnext don’t really make a big difference on x86, maybe it did on arm.

Thanks for all the great ideas.

You want to learn about Revise.jl and or: GitHub - dmolina/DaemonMode.jl: Client-Daemon workflow to run faster scripts in Julia

2 Likes

I just run your script, changing:

  1. the last line to @time main()
  2. Remove using Distributions (not used)

Then, just did:

time julia trnbias.jl 0 200 0.01 200

I get:

# lots of printing
  5.192750 seconds (7.82 M allocations: 6.508 GiB, 2.97% gc time, 0.09% compilation time)

real    0m6,319s
user    0m6,393s
sys     0m0,327s

Thus: 1) compilation time is not an issue; 2) Julia startup is a minor portion of the time (~1s).

Then solved those @views indicated above, with:

                    short_sum = sum(@view(x[i - ishort + 1:i + 1]))
                    long_sum = sum(@view(x[i - ilong + 1:i + 1]))
# and here
    for i in (long_term:ncases-2)
        short_mean = sum(@view(x[i - short_term + 1:i + 1])) / short_term
        long_mean = sum(@view(x[i - long_term + 1:i + 1])) / long_term

And I get:

  3.949312 seconds (3.67 k allocations: 289.875 KiB, 0.11% compilation time)

real    0m5,095s
user    0m5,180s
sys     0m0,318s

Thus gaining about 30% gain in performance for the small test (but not for the largest one, as I will show at the end).

The profiler (run @profview optimize(0, 200, 0.01, 200) in VScode) says that most of the time is spent on these lines:

                    short_sum += x[i + 1] - x[i - ishort]
                    long_sum += x[i + 1] - x[i - ilong]
#and
 
                ret = (short_mean > long_mean) ? x[i + 1] - x[i] :
                      (short_mean < long_mean) ? x[i] - x[i + 1] : 0.0

where above the cost is in getindex calls and below on > and < calls. Except adding inbounds there, which improves a little the performance, I don’t see anything else that could be done to accelerate much the code, as it is.

Finally, I run your original code (just removing using Distributions and adding @time main() at end), with time julia trnbias.jl 0 500 0.01 500, and got:

 59.313881 seconds (20.13 M allocations: 16.728 GiB, 0.43% gc time, 0.01% compilation time)

real    1m0,626s
user    1m0,513s
sys     0m0,477s

Then, with the @views in place:

 58.977891 seconds (3.67 k allocations: 292.172 KiB, 0.01% compilation time)

real    1m0,339s
user    1m0,343s
sys     0m0,397s

My conclusions:

  • Startup and compilation time are not relevant for your tests, if you keep them taking at least a few seconds.
  • The use of views reduces significantly the allocations, but that’s not the critical part for performance.
  • The execution seems to be memory-bounded, as getindex is taking most of the time (here).

Can you explain in more detail how to compile and run the Rust and C++ codes so we can test here more easily? Given the tests above I find suprising that such a big performance difference is found (I don´t think bounds checking justifies the difference between C++ and Rust either, in Julia disabling them gives a minor speedup).

(ps: I got a lot of errors trying to compile the Rust or C++ versions, so clearly I don’t know what I’m doing there)

3 Likes

Are you running it on OSX or Linux ? It uses some osx specific headers i think but you can try:

clang++ -Wall -O3 -g -lcurses -std=c++11 TrnBias.CPP -o bin/TRNBIAS
./bin/TRNBIAS 0 1000 0.01 1000

For the rust:

cargo init trnbias
cd trnbias
cargo add rand
`copy pastebin into trnbias/src/main.rs`
cargo build --release
@time cargo run --release -- 0 1000 0.01 1000

@lmiq thanks for the detailed feedback. also any. clue how the view gives you 30% boost on the small but nothing on the large? do we get the array in cache in the small case vs not in the large so then the getindex memory fetch becomes the actual issue and dwards the view improvement or ?

2 Likes

I still got a lot of errors:


% cargo build --release
   Compiling trnbias v0.1.0 (/home/leandro/Downloads/trivial/trnbias)
error[E0252]: the name `SeedableRng` is defined multiple times
 --> src/main.rs:3:21
  |
1 | use rand::SeedableRng;
  |     ----------------- previous import of the trait `SeedableRng` here
2 | use rand_distr::Distribution;
3 | use rand::{RngCore, SeedableRng};
  |                     ^^^^^^^^^^^ `SeedableRng` reimported here
  |
  = note: `SeedableRng` must be defined only once in the type namespace of this module

error[E0432]: unresolved import `rand_distr`
 --> src/main.rs:2:5
  |
2 | use rand_distr::Distribution;
  |     ^^^^^^^^^^ use of undeclared crate or module `rand_distr`

error[E0433]: failed to resolve: use of undeclared type `SystemTime`
  --> src/main.rs:28:29
   |
28 |             .duration_since(SystemTime::UNIX_EPOCH)
   |                             ^^^^^^^^^^ use of undeclared type `SystemTime`

error[E0412]: cannot find type `Wrapping` in this scope
  --> src/main.rs:13:8
   |
13 |     i: Wrapping<u8>,
   |        ^^^^^^^^ not found in this scope
   |
help: consider importing one of these items
   |
1  | use core::num::Wrapping;
   |
1  | use std::num::Wrapping;
   |

error[E0433]: failed to resolve: use of undeclared type `SystemTime`
  --> src/main.rs:27:36
   |
27 |         let duration_since_epoch = SystemTime::now()
   |                                    ^^^^^^^^^^ not found in this scope
   |
help: consider importing this struct
   |
1  | use std::time::SystemTime;
   |

error[E0425]: cannot find function, tuple struct or tuple variant `Wrapping` in this scope
  --> src/main.rs:45:25
   |
45 |             let mut j = Wrapping(self.mwc256_seed as u32);
   |                         ^^^^^^^^ not found in this scope
   |
help: consider importing one of these items
   |
1  | use core::num::Wrapping;
   |
1  | use std::num::Wrapping;
   |

error[E0425]: cannot find function, tuple struct or tuple variant `Wrapping` in this scope
  --> src/main.rs:46:22
   |
46 |             let c1 = Wrapping(69069);
   |                      ^^^^^^^^ not found in this scope
   |
help: consider importing one of these items
   |
1  | use core::num::Wrapping;
   |
1  | use std::num::Wrapping;
   |

error[E0425]: cannot find function, tuple struct or tuple variant `Wrapping` in this scope
  --> src/main.rs:47:22
   |
47 |             let c2 = Wrapping(12345);
   |                      ^^^^^^^^ not found in this scope
   |
help: consider importing one of these items
   |
1  | use core::num::Wrapping;
   |
1  | use std::num::Wrapping;
   |

error[E0425]: cannot find function, tuple struct or tuple variant `Wrapping` in this scope
  --> src/main.rs:85:16
   |
85 |             i: Wrapping(255u8),
   |                ^^^^^^^^ not found in this scope
   |
help: consider importing one of these items
   |
1  | use core::num::Wrapping;
   |
1  | use std::num::Wrapping;
   |

warning: unused import: `SeedableRng`
 --> src/main.rs:3:21
  |
3 | use rand::{RngCore, SeedableRng};
  |                     ^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

Some errors have detailed explanations: E0252, E0412, E0425, E0432, E0433.
For more information about an error, try `rustc --explain E0252`.
warning: `trnbias` (bin "trnbias") generated 1 warning
error: could not compile `trnbias` due to 9 previous errors; 1 warning emitted

Sorry for the delay my laptop exploded. So no running on a linux machine the rust and julia are kinda identical, within a small margin, i think i have a bug in the julia though because the results with the seed of 33 don’t match the rust results which are correct.

I fixed and build the package. Here is main.rs:

and here is Cargo.toml

[package]
name = "trnbias"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
rand = "0.8.5"
rand_distr = "0.4.3"

You should have the following folders:

trnbias
-- Cargo.toml
-- src
-- -- main.rs

Let me know if it works for you, if the perf is identical for you or not.

I get these results:

with Rust:

Mean IS=0.0444  OOS=-0.0014  Bias=0.0458

real	0m28,859s
user	0m28,851s
sys	0m0,008s

with Julia:

Mean IS=0.04735720711353852  OOS=0.009247068550281812  Bias=0.03811013856325671
 38.662043 seconds (3.67 k allocations: 294.984 KiB, 0.01% compilation time)

real	0m39,274s
user	0m39,476s
sys	0m0,940s

I don’t know if you consider those means correct, but clearly the codes are not doing exactly the same thing.

this is the Julia code I’m running (has some small edits):

Code
using Random
 
struct ParamsResult
    short_term::Int
    long_term::Int
    performance::Float64
end
 
@kwdef mutable struct MarsagliaRng
    q::Vector{UInt32} = Vector{UInt32}(undef, 256)
    carry::UInt32 = zero(UInt32)
    mwc256_initialized::Bool = false
    mwc256_seed::Int32 = zero(Int32)
    i::UInt8 = zero(UInt8)
end
 
function seed_mwc256!(rng::MarsagliaRng, seed::UInt32)
    #rng.q = Vector{UInt32}(undef, 256) # reallocates the rng.q array
    rng.carry = zero(UInt32)
    rng.mwc256_initialized = false
    rng.mwc256_seed = Int32(seed)
    rng.i = zero(UInt8)
    j = UInt32(seed)
    c1 = UInt32(69069)
    c2 = UInt32(12345)
    for k in 1:256
        j = (c1 * j + c2) & 0xFFFFFFFF  # Ensure the result fits within a UInt32
        rng.q[k] = j
    end
    return rng
end
 
function next_u32(rng::MarsagliaRng)
    a = UInt64(809430660)
    if !rng.mwc256_initialized
        j = UInt32(rng.mwc256_seed)
        c1 = UInt32(69069)
        c2 = UInt32(12345)
        rng.mwc256_initialized = true
        for k in 1:256
            j = (c1 * j + c2) & 0xFFFFFFFF  # Ensure the result fits within a UInt32
            rng.q[k] = j
        end
    end
    rng.i = UInt8(rng.i % 256) 
    t = a * UInt64(rng.q[rng.i+1]) + UInt64(rng.carry)
    rng.carry = UInt32(t >> 32)
    rng.q[rng.i+1] = UInt32(t & 0xFFFFFFFF)
    return rng.q[rng.i+1]
end
 
function seed!(rng::MarsagliaRng, seed::UInt32)
    seed_mwc256!(rng, seed)
    return rng
end
 
 
function Random.rand(rng::MarsagliaRng, ::Type{Float64})
    #mult = 1.0 / 0xFFFFFFFF
    mult = inv(0xFFFFFFFF)
    return mult * next_u32(rng)
end
 
@views function opt_params(which::Int, ncases::Int, x::Vector{Float64})
    best_perf = -Inf
    ibestshort = 0
    ibestlong = 0
 
    for ilong in 2:199
        for ishort in 1:(ilong - 1)
            total_return = 0.0
            win_sum = 1e-60
            lose_sum = 1e-60
            sum_squares = 1e-60
            short_sum = 0.0
            long_sum = 0.0
 
            for i in (ilong:ncases-2)
                if i == ilong
                    short_sum = sum(x[i - ishort + 1:i + 1])
                    long_sum = sum(x[i - ilong + 1:i + 1])
                else
                    short_sum += x[i + 1] - x[i - ishort]
                    long_sum += x[i + 1] - x[i - ilong]
                end
 
                short_mean = short_sum / ishort
                long_mean = long_sum / ilong
 
                ret = (short_mean > long_mean) ? x[i + 1] - x[i] :
                      (short_mean < long_mean) ? x[i] - x[i + 1] : 0.0
 
                total_return += ret
                sum_squares += ret^2
                if ret > 0.0
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end
 
            if which == 0
                total_return /= (ncases - ilong)
                if total_return > best_perf
                    best_perf = total_return
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 1
                pf = win_sum / lose_sum
                if pf > best_perf
                    best_perf = pf
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 2
                total_return /= (ncases - ilong)
                sum_squares /= (ncases - ilong)
                sum_squares -= total_return^2
                sr = total_return / (sqrt(sum_squares) + 1e-8)
                if sr > best_perf
                    best_perf = sr
                    ibestshort = ishort
                    ibestlong = ilong
                end
            end
        end
    end
 
    ParamsResult(ibestshort, ibestlong, best_perf)
end
 
@views function test_system(ncases::Int, x::Vector{Float64}, short_term::Int, long_term::Int)
    sum1 = 0.0
 
    for i in (long_term:ncases-2)
        short_mean = sum(x[i - short_term + 1:i + 1]) / short_term
        long_mean = sum(x[i - long_term + 1:i + 1]) / long_term
 
        if short_mean > long_mean
            sum1 += x[i + 1] - x[i]
        elseif short_mean < long_mean
            sum1 -= x[i + 1] - x[i]
        end
    end
 
    sum1 / (ncases - long_term)
end
 
function main()
    if length(ARGS) < 4
        println("Usage: julia script.jl <which> <ncases> <trend> <nreps>")
        return
    end
 
    println("sikeinanan")
 
    which = parse(Int, ARGS[1])
    ncases = parse(Int, ARGS[2])
    save_trend = parse(Float64, ARGS[3])
    nreps = parse(Int, ARGS[4])
    
    optimize(which, ncases, save_trend, nreps)
end
 
function optimize(which::Int, ncases::Int, save_trend::Float64, nreps::Int)
    x = zeros(Float64, ncases)
 
    is_mean = 0.0
    oos_mean = 0.0
 
    rng  = MarsagliaRng()
    seed_mwc256!(rng, UInt32(33))
 
    for irep in 1:nreps
        # Generate in-sample
        trend = save_trend
        x[1] = 0.0
        for i in 2:ncases
            if i % 50 == 0
                trend = -trend
            end
            x[i] = x[i - 1] + trend +rand(rng, Float64) +rand(rng, Float64) -rand(rng, Float64) -rand(rng, Float64)
        end
 
        params_result = opt_params(which, ncases, x)
 
        # Generate out-of-sample
        trend = save_trend
        x[1] = 0.0
        for i in 2:ncases
            if i % 50 == 0
                trend = -trend
            end
            x[i] = x[i - 1] + trend +rand(rng, Float64) +rand(rng, Float64) -rand(rng, Float64) -rand(rng, Float64)
        end
 
        oos_perf = test_system(ncases, x, params_result.short_term, params_result.long_term)
 
        is_mean += params_result.performance
        oos_mean += oos_perf
 
        #println("$irep: $(params_result.short_term) $(params_result.long_term)  $(params_result.performance) $oos_perf ($(params_result.performance - oos_perf))")
    end
 
    is_mean /= nreps
    oos_mean /= nreps
 
    println("Mean IS=$is_mean  OOS=$oos_mean  Bias=$(is_mean - oos_mean)")
    
end
 
@time main()

With @inbounds in the critical lines I’ve mentioned above, I get:

Mean IS=0.04735720711353852  OOS=0.009247068550281812  Bias=0.03811013856325671
 32.964694 seconds (3.67 k allocations: 294.984 KiB, 0.01% compilation time)

real	0m33,593s
user	0m33,860s
sys	0m0,876s

The Julia code is wrong, also can be simplified to just use Julia’s rng, what I did in trnbias2.jl

$ julia -O3 trnbias.jl 0 200 0.01 200
sikeinanan
Mean IS=0.33951025199478707  OOS=0.00715903341201321  Bias=0.33235121858277383
  1.911280 seconds (29 allocations: 4.680 KiB)

vs. tiny bit faster, seemingly a real effect (despite @time showing otherwise):

$ julia -O3 trnbias2.jl 0 200 0.01 200
sikeinanan
Mean IS=0.3463935914030005  OOS=0.004606007118324256  Bias=0.34178758428467626
  1.915384 seconds (27 allocations: 3.586 KiB)

showing you’re missing some important @inbounds:

$ julia --check-bounds=no -O3 trnbias2.jl 0 200 0.01 200
sikeinanan
Mean IS=0.3406308594068922  OOS=0.03475379486225595  Bias=0.30587706454463626
  1.609972 seconds (27 allocations: 3.586 KiB)

$ hyperfine 'julia -O3 trnbias.jl 0 200 0.01 200'
Benchmark 1: julia -O3 trnbias.jl 0 200 0.01 200
  Time (mean ± σ):      2.652 s ±  0.049 s    [User: 3.052 s, System: 0.492 s]
  Range (min … max):    2.585 s …  2.745 s    10 runs

$ hyperfine 'julia -O3 trnbias2.jl 0 200 0.01 200'
Benchmark 1: julia -O3 trnbias2.jl 0 200 0.01 200
  Time (mean ± σ):      2.597 s ±  0.029 s    [User: 2.967 s, System: 0.522 s]
  Range (min … max):    2.550 s …  2.632 s    10 runs

min is meaningful there, and mean isn’t what’s important.

No, Julia doesn’t overflow checks, by default, but you can opt into it. It does bound-check, yes, and that’s at least potentially one of your problems.

In Rust:
self.i += 1;

that’s actually a wrapped 8-bit int type seemingly, while in Julia:

rng.i = UInt8(rng.i % 256)

but no increment (that alone should have made faster…, I think there it’s a no-op), I do see a reference to that rng.i i.e. plus 1, but it’s never stored.

    rng.q[rng.i+1] = UInt32(t & 0xFFFFFFFF)
    return rng.q[rng.i+1]

those likes are bound-checked unless you add @inbounds, and Julia needs to add 1, since Julia is 1-based, unlike Rust, which is 0-based. You could have avoided that with a 0-based array, or simply a different rng.

2 Likes

From the Rust book

When you’re compiling in debug mode, Rust includes checks for integer overflow […] When you’re compiling in release mode with the --release flag, Rust does not include checks for integer overflow that cause panics. Instead, if overflow occurs, Rust performs two’s complement wrapping .

1 Like

Can you please share your code, I need to use the rng provided but I couldnt see the issue you mentioned, I, does get initialized and then changed following the rust code, where is the discrepency? Is that the only issue you found? Thanks! @Palli

Note, I didn’t locate where you need to add @inbounds in the original, nor in my changed code, but what I changed to get it:

$ diff trnbias2.jl trnbias.jl
1c1
< #using Random
---
> using Random
9,10c9,58
< #function Random.rand(rng::MarsagliaRng, ::Type{Float64})
< function rand2()
---
> @kwdef mutable struct MarsagliaRng

[dropped everything related to it ..]

> function Random.rand(rng::MarsagliaRng, ::Type{Float64})
13c61
<     return mult * rand(UInt32)
---
>     return mult * next_u32(rng)
124,125c172,173
< #    rng  = MarsagliaRng()
< #    seed_mwc256!(rng, UInt32(33))
---
>     rng  = MarsagliaRng()
>     seed_mwc256!(rng, UInt32(33))
135c183
<             x[i] = x[i - 1] + trend +rand2() +rand2() -rand2() -rand2()
---
>             x[i] = x[i - 1] + trend +rand(rng, Float64) +rand(rng, Float64) -rand(rng, Float64) -rand(rng, Float64)
147c195
<             x[i] = x[i - 1] + trend +rand2() +rand2() -rand2() -rand2()
---
>             x[i] = x[i - 1] + trend +rand(rng, Float64) +rand(rng, Float64) -rand(rng, Float64) -rand(rng, Float64)

I cant see the definition of rand2, do you mind pasting your full code in pastebin or smth? Thanks! Do you get identical to rust results with the fix ?