Trivial port from C++/Rust to Julia

Sorry the diff was maybe hard to read (since I dropped irrelevant parts of it), could still have worked I think with the patch command:

function rand2()
    #mult = 1.0 / 0xFFFFFFFF
    mult = inv(0xFFFFFFFF)
    return mult * rand(UInt32)
end

You would never get bit-identical with Rust, with my changed code, even with identical seed, since the code you have there uses Marsaglia RNG, and I wrote it out, to use Julia’s RNG, and it gets a different stream, but that’s ok, you hardly rely on bit-identical since the purpose ir random numbers:

julia> seed_mwc256!(rng, UInt32(33))
MarsagliaRng(UInt32[0x0022f7a6, 0xda39ee27, 0x9fa2e074, 0xfb47d11d, 0xaf3e1d72, 0x99ae8e83, 0x7c87f620, 0x967fe7d9, 0xeb0bddfe, 0xb0c4da9f  …  0xb9dd9ef0, 0xb67aa669, 0x0113c74e, 0xa543bfaf, 0x87ab925c, 0xed2f0fe5, 0x62638b9a, 0x6367cc8b, 0xac0d0a88, 0x7a828721], 0x00000000, false, 33, 0x00)

julia> next_u32(rng)
0xae780598

julia> next_u32(rng)
0x65ecc967

As you see the code generates something, and maybe I was wrong about the code wrong, I thought I saw a bug, and though you would always get the same number.

Julia’s RNG wasn’t faster, nor the other (much) faster. You could go either way. The speedup was from less bounds-checking, and I didn’t locate exactly where to selectively turn it off. I just did it globally.

Yeah there’s more bugs im fixing one by one the array indexing and how the original C++ worked very tough to port. Also the Julia debugger is ultra slow :frowning: not sure why. The issues are all here in test_system and

            for i in (ilong:ncases-2)
                if i == ilong 
                    short_sum = sum(x[(i-ishort+1):i])
                    long_sum = sum(x[(i-ilong+1):i])
                else
                    short_sum += x[i] - x[i-ishort]
                    long_sum += x[i] - x[i-ilong]
                end


                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+1] - x[i] :
                      (short_mean < long_mean) ? x[i] - x[i+1] : 0.0

                total_return += ret
                sum_squares += ret^2
                if ret > 0.0
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

Are you manually porting the indexes? It would probably be better to use OffsetArrays.jl I think. Plus, you are assuming the arrays start at 1.

2 Likes

Another option is to use the arr[begin + i] syntax, where i is zero-based. So it corresponds to arr[i] in C++.

3 Likes

I fixed all the indexes and finally we get:

Rust:
Time (mean ± σ): 50.810 s ± 6.286 s [User: 50.801 s, System: 0.005 s]

Julia:
Single result which took 53.725 s (2.13% GC) to evaluate,

I am assuming without the printing and string allocs Julia might actually be faster. I sitll have no idea how fixing the indexes helped as it was off by 1 errors and such but that closes the thread. Here’s the fixed code if anyone cares:

4 Likes

Don´t forget the @view() where slices of x are created:

sum(@view(x[(i-ishort+2):i+1]))

(in 4 places)

It makes quite a difference for allocations (not so for time):

without:

Mean IS=0.0444250155324138  OOS=-0.0014158874853206235  Bias=0.04584090301773442
254.429745 seconds (41.79 M allocations: 33.820 GiB, 0.11% gc time, 0.14% compilation time)

real	4m15,677s
user	4m15,576s
sys	0m0,389s

with views:

Mean IS=0.0444250155324138  OOS=-0.0014158874853206235  Bias=0.04584090301773442
257.709245 seconds (545.44 k allocations: 38.141 MiB, 0.00% gc time, 0.14% compilation time)

real	4m18,967s
user	4m19,037s
sys	0m0,337s

And, if I may suggest, you can write the struct like this, which I think is cleaner:

@kwdef mutable struct MarsagliaRng
    q::Vector{UInt32} = zeros(UInt32, 256)
    carry::UInt32 = 362436
    mwc256_initialized::Bool = false
    mwc256_seed::Int32 = Base.rand(Int32)
    i::UInt8 = 255
end
MarsagliaRng(seed::Vector{UInt8}) = MarsagliaRng(mwc256_seed=Int32(first(seed)))

I created a github repo including the Julia, Rust, and C++ versions.

I included the optimizations and corrections to the Julia code mentioned above as well as maybe one more. Also included a small number of stylistic changes.

I did not optimize the C++ or Rust further.

As it stands the Julia is significantly faster than the Rust and C++.

It would be interesting to know how much work it would take make the Rust and C++ perform as well as the Julia code. The Julia code is fairly simple, and the optimizations are fairly standard.

For me the Rust version is slightly faster. I have not added @inbounds anywhere (I was under the impression that the Rust version checks bounds). Also, for me, running Julia with the default flags (that is, not with -O3) turns out to better.

Also, I must note that that example takes ~50s in my machine (vs apparently 7s in yours), so clearly we are in different worlds :slight_smile: .

I benchmarked with 200 trials rather than 1000. In order to compare, this is noted in the readmes in the repo I linked.

Yes, the rust code does do bounds checking. There are ways around this. But I did not try them.

If I recall correctly there was some positive interaction between @view and @inbounds.

@feribg

Are we here under the “benchmark-game language shootout” rules? I.e. write whatever you believe you can justify as “idiomatic” but don’t change the algorithm?

Or are we under “solve this problem as fast as possible” rules where algorithmic improvements are admissible and we don’t have to haggle whether they change the alg?

If we want to keep the general algorithm, then:

  1. Make which a Val
  2. Precompute cumsum, such that short_sum / long_sum have no loop carried dependencies. Also precompute sum_squares.
  3. Make the computation of ret branchfree. Probably precompute x[i+2] - x[i+1]
  4. simd the innermost loop.
  5. multi-thread the outermost loop.
1 Like

That takes 50s in my laptop :smile:

Initial timing:

julia> include("trnbias-orig.jl"); @time optimize(0, 1000, 0.01, 200); @time optimize(0, 1000, 0.01, 200);
Usage: julia script.jl <which> <ncases> <trend> <nreps>
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  6.505962 seconds (40 allocations: 11.656 KiB)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  6.507056 seconds (40 allocations: 11.656 KiB)

Make which a Val:

julia> include("trnbias.jl"); @time optimize(0, 1000, 0.01, 200); @time optimize(0, 1000, 0.01, 200);
Usage: julia script.jl <which> <ncases> <trend> <nreps>
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  5.198919 seconds (95.50 k allocations: 6.434 MiB, 0.46% compilation time)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  5.182652 seconds (441 allocations: 21.328 KiB)

Precompute cumsum, sprinkle @simd @fastmath, make it multithreaded (but run only on a single thread):

julia> include("trnbias.jl"); @time optimize(0, 1000, 0.01, 200); @time optimize(0, 1000, 0.01, 200);
Usage: julia script.jl <which> <ncases> <trend> <nreps>
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047785  OOS=-0.0031496723033457423  Bias=0.04777531559382359
  2.953527 seconds (125.84 k allocations: 12.977 MiB, 1.91% compilation time)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047785  OOS=-0.0031496723033457423  Bias=0.04777531559382359
  2.896009 seconds (3.84 k allocations: 4.840 MiB)

Run that code with julia -t auto:

julia> include("trnbias.jl"); @time optimize(0, 1000, 0.01, 200); @time optimize(0, 1000, 0.01, 200);
Usage: julia script.jl <which> <ncases> <trend> <nreps>
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047785  OOS=-0.0031496723033457423  Bias=0.04777531559382359
  0.452778 seconds (247.05 k allocations: 20.896 MiB, 257.25% compilation time)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047785  OOS=-0.0031496723033457423  Bias=0.04777531559382359
  0.369494 seconds (3.86 k allocations: 4.841 MiB)

I started from https://github.com/jlapeyre/DiscourseQuestion24Nov2023/blob/main/julia_trnbias/trnbias.jl and tried to make only minimal changes.

Note that I still use your weird RNG choice for reproducibility / same seeds. The tiny changes are mostly because floating point is not associative and @fastmath gives the compiler latitude to play with that.

Code below the fold.
using Random
using Printf

struct ParamsResult
    short_term::Int
    long_term::Int
    performance::Float64
end

mutable struct MarsagliaRng
    q::Vector{UInt32}
    carry::UInt32
    mwc256_initialized::Bool
    mwc256_seed::Int32
    i::UInt8

    function MarsagliaRng(seed::Vector{UInt8})
        q = zeros(UInt32, 256)
        carry = 362436
        mwc256_initialized = false
        mwc256_seed = reinterpret(Int32, seed)[1]
        i = 255
        new(q, carry, mwc256_initialized, mwc256_seed, i)
    end
end

# Default constructor using system time for seeding
function MarsagliaRng()
    ts_nano = UInt8(nanosecond(now()))
    seed = [ts_nano, ts_nano >>> 8, ts_nano >>> 16, ts_nano >>> 24]
    MarsagliaRng(seed)
end

# Random number generator functions
function next_u32(rng::MarsagliaRng)::UInt32
    a = UInt64(809430660)

    if !rng.mwc256_initialized
        j = UInt32(rng.mwc256_seed)
        c1 = UInt32(69069)
        c2 = UInt32(12345)
        rng.mwc256_initialized = true
        for k in 1:256
            j = (c1 * j + c2) % UInt32
            rng.q[k] = j
        end
    end

    rng.i = (rng.i + 1) % 256
    t = a * UInt64(rng.q[rng.i+1]) + UInt64(rng.carry)
    rng.carry = (t >> 32)
    rng.q[rng.i+1] = t % UInt32

    return rng.q[rng.i+1]
end

function next_u64(rng::MarsagliaRng)::UInt64
    UInt64(next_u32(rng))
end

# EDIT: extend Base.rand rather than shadowing it.
# Sample function for generating a Float64
function Base.rand(rng::MarsagliaRng)::Float64
    mult = 1.0 / typemax(UInt32)
    mult * next_u32(rng)
end


function opt_params(w::Val{which}, ncases::Integer, x::Vector{Float64}) where which
    FloatT = eltype(x)
    xs = cumsum(x)
    best_perf = typemin(FloatT)
    ibestshort = 0
    ibestlong = 0

    small_float = 1e-60
@inbounds for ilong in 2:199
        for ishort in 1:(ilong-1)
            total_return = zero(FloatT)
            win_sum = small_float
            lose_sum = small_float
            sum_squares = small_float
            short_sum = zero(FloatT)
            long_sum = zero(FloatT)


            let i = ilong - 1
                short_sum = xs[i+1] - xs[i-ishort + 2 - 1] 
                long_sum = xs[i+1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x[i+1] :
                      (short_mean < long_mean) ? x[i+1] - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            # General case; i != ilong - 1
             @fastmath @simd for i in (ilong:ncases-2)
                x1 = x[i+1]
		#short_sum += x1 - x[i-ishort+1]
		#long_sum += x1 - x[i-ilong+1]

                short_sum = xs[i+1] - xs[i-ishort + 1] 
                long_sum = xs[i+1] - xs[i-ilong + 1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x1 :
                      (short_mean < long_mean) ? x1 - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            if which == 0
                total_return /= (ncases - ilong)
                if total_return > best_perf
                    best_perf = total_return
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 1
                pf = win_sum / lose_sum
                if pf > best_perf
                    best_perf = pf
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 2
                total_return /= (ncases - ilong)
                sum_squares /= (ncases - ilong)
                sum_squares -= total_return^2
                sr = total_return / (sqrt(sum_squares) + 1e-8)
                if sr > best_perf
                    best_perf = sr
                    ibestshort = ishort
                    ibestlong = ilong
                end
            end
        end
    end 

    ParamsResult(ibestshort, ibestlong, best_perf)
end

function test_system(ncases::Integer, x::Vector{Float64}, short_term::Integer, long_term::Integer)
    FloatT = eltype(x)
    sum1 = zero(FloatT)

    for i in (long_term-1:ncases-2)
        short_mean = sum(@view x[i-short_term+2:i+1]) / short_term
        long_mean = sum(@view x[i-long_term+2:i+1]) / long_term

        if short_mean > long_mean
            sum1 += x[i+2] - x[i+1]
        elseif short_mean < long_mean
            sum1 -= x[i+2] - x[i+1]
        end

    end
    sum1 / (ncases - long_term)
end

function main()
    if length(ARGS) < 4
        println("Usage: julia script.jl <which> <ncases> <trend> <nreps>")
        return
    end

    which = parse(Int, ARGS[1])
    ncases = parse(Int, ARGS[2])
    save_trend = parse(Float64, ARGS[3])
    nreps = parse(Int, ARGS[4])

    optimize(which, ncases, save_trend, nreps)
end

function optimize(which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer, rng=MarsagliaRng(UInt8.([33, 0, 0, 0])))
    @show typeof(rng)
    _optimize(rng, which, ncases, save_trend, nreps)
end

function _optimize(rng, which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer)
    print_results = false

    FloatT = typeof(save_trend)
    x = zeros(FloatT, ncases)

    is_mean = zero(FloatT)
    oos_mean = zero(FloatT)
    futures = Task[]

    @inbounds for irep in 1:nreps
        # Generate in-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_optim = copy(x)

        # Generate out-of-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_oos = copy(x)
        future = Threads.@spawn begin
        params_result = opt_params(Val(which), ncases, x_optim)

        oos_perf = test_system(ncases, x_oos, params_result.short_term, params_result.long_term)
        (params_result, oos_perf)
        end
        push!(futures, future)


    end
    for (irep, future) in enumerate(futures)
        params_result::ParamsResult, oos_perf::Float64 = fetch(future)
        is_mean += params_result.performance
        oos_mean += oos_perf
        if print_results
            @printf("%3d: %3d %3d  %8.4f %8.4f (%8.4f)\n",
                    irep, params_result.short_term, params_result.long_term, params_result.performance, oos_perf, params_result.performance - oos_perf)
        end
    end

    is_mean /= nreps
    oos_mean /= nreps

    println("Mean IS=$is_mean  OOS=$oos_mean  Bias=$(is_mean - oos_mean)")

end

main()

8 Likes

When I see so much talent and creativity jump at the chance of making a Julia program better, I am in awe. Every time… It is a privilege to be part of this community.

3 Likes

I’m surprised. I tried making which a global const earlier and it did not change the performance.

Nor for me

julia> @time optimize(Val(0), 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  6.714259 seconds (40 allocations: 11.656 KiB)

julia> @time optimize(Val(0), 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  6.611550 seconds (40 allocations: 11.656 KiB)

julia> @time optimize(0, 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  6.578942 seconds (40 allocations: 11.656 KiB)

julia> @time optimize(0, 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  6.871638 seconds (40 allocations: 11.656 KiB)

@fastmath didn’t help for me, either.
I also tried -Ofast on the C++ code, and it made it slower.

@Elrod can you reproduce the good timings with my code?

The important thing about Val is the following:

function opt_params(...
...
for i in (ilong:ncases-2)
                ...
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
end
            if which == 0
                total_return /= (ncases - ilong)
                if total_return > best_perf
                    best_perf = total_return
                    ibestshort = ishort
                    ibestlong = ilong
                end
             elseif ...
end

If which is zero, then win_sum, lose_sum, sum_squares are all unused.

Hence, it is very profitable to specialize the function body on which – if it is known zero at compile time, then all the unused computations can be removed by the optimizer.

In C++ you do this by making it a template parameter instead of an integer. Rust has a similar approach. In C, you do this via #define. In julia you do this via a Val parameter.

An alternative is to annotate with @inline and have code like

if which == 0
opt_params(0, ...)
elseif which == 1
opt_params(1, ...)
else
opt_params(2, ...)
end

and rely on constant propagation.

My day job involves java; there is no good alternative to inlining + const-prop there, but it is quite unreliable and requires different coding patterns depending the JVM config (graal or C2? What config flags? Are you kidding me sun/oracle? Is it so difficult to just fucking expose a runtime-retained annotation and then let the C2/graal perf teams go ham with it?). There exist ways to force inlining in the JVM, but these are basically un-deployable in a business context. /rant

Oops – I made a mistake.
I actually get about 5.3s

julia> @time optimize(Val(0), 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.003149672303345741  Bias=0.047775315593823586
  5.271919 seconds (39 allocations: 11.297 KiB)

julia> @time optimize(Val(0), 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.003149672303345741  Bias=0.047775315593823586
  5.310404 seconds (39 allocations: 11.297 KiB)

@fastmath gets me down to around 5.2s.

I get approximately the same time with clang when using template <int which> instead of passing which as an argument to opt_params.

Yep, that’s mostly equivalent.

The cool thing in julia is that you can do optimize(0, ...) and then use dynamic dispatch inside to select the proper specialized version. Type-instability + dynamic dispatch is not always a sign of “low quality scripting code”. It is a powerful tool to control code specialization!

@fastmath becomes really important once you do the cumsum variant for the sliding window, plus maybe @simd. That brought me down to ~3 seconds. But the code is almost surely not properly vectorizing at 3 seconds.

I am wary of function barriers, as I’ve often seen them introduce 10+ seconds of inference time.

I made minimal modifications and haven’t really looked at the code, so I didn’t make any modifications like that or looked at what it’d take to vectorize well.

I did observe that setting -Cnative,-prefer-256-bit so that it would use 512 bit vectors hurt performance, which did imply some reasonably hot part of the code was getting vetorized.

I can reproduce your timing, but not the same order of importance of changes. For instance:

This version: https://github.com/jlapeyre/DiscourseQuestion24Nov2023/blob/main/julia_trnbias/trnbias.jl

Has which as a Val already, but takes for me 45 seconds, which is roughly the same as the version in which was not a type value.

Your version takes 3 seconds.

For me, what is impressive, is that just adding @inbounds @simd in this loop:

            # General case; i != ilong - 1
            @inbounds @simd for i in (ilong:ncases-2)
                x1 = x[i+1]

takes time down from 43 to 5 seconds.

(took your code, removed all other @inboounds and @fastmath flags, that makes the code run in 45s. Then, just by adding that one, time goes down to 5s - and the use which as a val does not change much the timing).

I’m concerned actually if my observation of 40+ seconds isn’t related to that loop not being automatically simd in my version of Julia (1.10+rc1) vs some version @jlapeyre or you may be using, because that could justify the major difference I was seeing.