Trivial port from C++/Rust to Julia

I wonder if there is something non-deterministic in finding optimizations. With the Val changes (but not @simd, @fastmath, cumsum, etc.) I find about 6 s 7.5s consistently after loading the code. (Also tried making a module, but this sometimes runs 20% slower… unless it doesn’t).

@foobar_lv2 thats amazing, Val makes by far the biggest difference, dropping the time from 20 secs → 8 and then disabling bounds check from 8->4.5

@time optimize(0, 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047784  OOS=-0.0031496723033457423  Bias=0.047775315593823586
  4.362402 seconds (103.07 k allocations: 6.964 MiB, 0.69% compilation time)

I read the performance docs on Val but they are super cryptic
https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-value-type

The thing is the rust version doesn’t need the template optimization it produces optimal code with which being int, so i dont understand how the Julia compiler gets confused or what it sees wrong especially given the Jit evaluates it and sees its a constant value.

Rust is doing the same at around 8 seconds, the compiler doesnt support bounds check removal out of the box, but the code could be optimized to do it by removing all array index accesses with iterators, which i definitely dont plan on doing.

This was a very good insight, would be useful if you can explain briefly how does Val which get used by the Julia compiler and what does it do differnetly under the hood to optimize that code since those branches are never hit anyways and the branch predictor would essentially eliminate them even on the hardware level…

That’s a good link!

I find the manual very clear and non-cryptic on that topic.

Alas, that doesn’t mean that the manual is clear for the intended audience, i.e. people like you who need to learn about that aspect of julia. It rather means that it is clear and non-cryptic only in retrospect, i.e. for people who already know all of that :frowning:

Can you help us figure out how we could have written that manual section such that it would have solved your issue from the get-go? What are the missing examples or words that would have made this click with you?

So there are two points:

especially given the Jit evaluates it and sees its a constant value.

We always call julia a Jit. But it really is a just-ahead-of-time compiler, without any of the amazing technologies that make modern jits so cool (e.g. javascript, JVM, .NET, luaJit): We don’t have tiered compilation with profile-guided optimization and guards that de-optimize on failed assumptions. The fact that which is always zero at runtime does nothing for our generated code.

What julia has instead is an absolutely genius-level awesome model of specialization.

Specialization/monomorphization in general means: You have some function; you don’t just compile it once, but instead compile it into many different functions that you use at different places. The goal of this is that each of your many different “copies” of your function is applicable under different narrow circumstances; narrow circumstances mean that you have a lot of super useful additional assumptions you can use during optimization.

Julia’s killer model is: Each combination of concrete types for all parameters is one specialization. Hence, each specialization always knows all concrete types of all parameters during compilation. This plays, not due to no chance, extremely well with invariant non-erased parametric types and dynamic multiple dispatch.

I’d like to link graalvm here; it’s informative to how much suckage julia gets to avoid because of this aspect of design.

The thing is the rust version doesn’t need the template optimization it produces optimal code with which being int,

Yes, a compiler can do that. It’s called loop unswitching.

Will the compiler unswitch the loop? Depends on heuristics. There’s a lot of code and nested loops to duplicate here!

Should you blindly rely on that? Hell no!

That makes your code unmaintainable and performance-wise unreadable: In order to guess whether your code is OK or a slow mess, the reader needs an accurate mental model of whether the condition has been loop unswitched! That is terrible mental overhead.

YOU know that this loop MUST be unswitched for performance. So you should use a template parameter / a Val{0} / lift the which integer to the type domain, to make it obvious to the reader as well.

I guess that rust has different parameters for the various heuristics that go into the same loop unswitching pass in llvm. Or maybe this is all about the exact julia version? Perusing github, there was a lot of reshuffling on the loop unswitching pass in the last year.

3 Likes

What Julia version are you using? What’s your machine? I’m astonished by how different are the reports here about the benchmarks. For me, the Rust version takes 14s vs. 45s for the pre-foobar version, and adding inbounds-simd in one inner loop takes the Julia code down to 7s, without any type val.

1 Like

thanks! That’s amazing. With regards to your question on the Val docs:

It’s not clear at all from the docs that Val basically means generate something like a template. Would be good to explain that, since you already mentioned that the multiple dispatch is essentially doing what template specialization does under the hood, so my mental model (correct if wrong) is that Val basically treats each value of Int as a separate type and would generate a new version of that function specialized for that type and remove all branches and paths related to the other values of int (types). Im still not quite sure how that works in combinations of what you said about Julia not being a traditional Jit, does it keep some kind of a cache of all the values it saw and when it sees a new one it triggers a recompile, which is probably something that should be mentioned in the docs as well.

So far so good, the counter example However, making use of such techniques can be surprisingly subtle. For example, it would be of no help if you called array3 from a function like this: is where it gets messy, so why would this not work ?

Also it looks like this Val ordeal is a bit of an anti pattern as the Julia linter says unused variable for w here function opt_params(w::Val{which}, ncases::Integer, x::Vector{Float64}) where which so seems like it doesn’t understand that pattern? Is that the case?

this was quite useful to grasp though as my code is full of this switching on some enum code, are enums treated the same way and need to be wrapped in Vals or they are special?

@lmiq
Julia 1.9.4
Rust 1.7.3
Ubuntu 22.04
11th Gen Intel(R) Core™ i5-1145G7 @ 2.60GHz
I ran it on a i9 macbook too and the speed was the same, because the core boosts so you get whatever is the max boost 4ghz etc, so doesnt make a big diff. If you’re on an Arm mac it’s possible to be way worse, as those cores dont boost have lower clock speeds, and i think the perf world stil llives in x64 mostly so llvm is probably doing a better job.

1 Like

You can remove the w there.

For the records, I get:

With the exact code posted by @foobar_lv2 (just added @timein front of main():

Julia 1.9.4:
3.773100 seconds (230.86 k allocations: 20.268 MiB, 1.67% gc time, 5.55% compilation time)

Juila 1.10+rc1:
3.844843 seconds (231.14 k allocations: 20.364 MiB, 4.26% compilation time)

Removing the ::Val{which} stuff, I get:

Julia 1.9.4:
3.858778 seconds (193.41 k allocations: 17.789 MiB, 1.50% gc time, 3.37% compilation time)

Julia 1.10+rc1
3.785784 seconds (193.30 k allocations: 17.899 MiB, 4.59% compilation time)

so, basically no difference, relative to the previous code, or among versions.

code ran here without the ::Val{which} stuff
using Random
using Printf

struct ParamsResult
    short_term::Int
    long_term::Int
    performance::Float64
end

mutable struct MarsagliaRng
    q::Vector{UInt32}
    carry::UInt32
    mwc256_initialized::Bool
    mwc256_seed::Int32
    i::UInt8

    function MarsagliaRng(seed::Vector{UInt8})
        q = zeros(UInt32, 256)
        carry = 362436
        mwc256_initialized = false
        mwc256_seed = reinterpret(Int32, seed)[1]
        i = 255
        new(q, carry, mwc256_initialized, mwc256_seed, i)
    end
end

# Default constructor using system time for seeding
function MarsagliaRng()
    ts_nano = UInt8(nanosecond(now()))
    seed = [ts_nano, ts_nano >>> 8, ts_nano >>> 16, ts_nano >>> 24]
    MarsagliaRng(seed)
end

# Random number generator functions
function next_u32(rng::MarsagliaRng)::UInt32
    a = UInt64(809430660)

    if !rng.mwc256_initialized
        j = UInt32(rng.mwc256_seed)
        c1 = UInt32(69069)
        c2 = UInt32(12345)
        rng.mwc256_initialized = true
        for k in 1:256
            j = (c1 * j + c2) % UInt32
            rng.q[k] = j
        end
    end

    rng.i = (rng.i + 1) % 256
    t = a * UInt64(rng.q[rng.i+1]) + UInt64(rng.carry)
    rng.carry = (t >> 32)
    rng.q[rng.i+1] = t % UInt32

    return rng.q[rng.i+1]
end

function next_u64(rng::MarsagliaRng)::UInt64
    UInt64(next_u32(rng))
end

# EDIT: extend Base.rand rather than shadowing it.
# Sample function for generating a Float64
function Base.rand(rng::MarsagliaRng)::Float64
    mult = 1.0 / typemax(UInt32)
    mult * next_u32(rng)
end


function opt_params(which, ncases::Integer, x::Vector{Float64})
    FloatT = eltype(x)
    xs = cumsum(x)
    best_perf = typemin(FloatT)
    ibestshort = 0
    ibestlong = 0

    small_float = 1e-60
@inbounds for ilong in 2:199
        for ishort in 1:(ilong-1)
            total_return = zero(FloatT)
            win_sum = small_float
            lose_sum = small_float
            sum_squares = small_float
            short_sum = zero(FloatT)
            long_sum = zero(FloatT)


            let i = ilong - 1
                short_sum = xs[i+1] - xs[i-ishort + 2 - 1] 
                long_sum = xs[i+1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x[i+1] :
                      (short_mean < long_mean) ? x[i+1] - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            # General case; i != ilong - 1
             @fastmath @simd for i in (ilong:ncases-2)
                x1 = x[i+1]
		#short_sum += x1 - x[i-ishort+1]
		#long_sum += x1 - x[i-ilong+1]

                short_sum = xs[i+1] - xs[i-ishort + 1] 
                long_sum = xs[i+1] - xs[i-ilong + 1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x1 :
                      (short_mean < long_mean) ? x1 - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            if which == 0
                total_return /= (ncases - ilong)
                if total_return > best_perf
                    best_perf = total_return
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 1
                pf = win_sum / lose_sum
                if pf > best_perf
                    best_perf = pf
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 2
                total_return /= (ncases - ilong)
                sum_squares /= (ncases - ilong)
                sum_squares -= total_return^2
                sr = total_return / (sqrt(sum_squares) + 1e-8)
                if sr > best_perf
                    best_perf = sr
                    ibestshort = ishort
                    ibestlong = ilong
                end
            end
        end
    end 

    ParamsResult(ibestshort, ibestlong, best_perf)
end

function test_system(ncases::Integer, x::Vector{Float64}, short_term::Integer, long_term::Integer)
    FloatT = eltype(x)
    sum1 = zero(FloatT)

    for i in (long_term-1:ncases-2)
        short_mean = sum(@view x[i-short_term+2:i+1]) / short_term
        long_mean = sum(@view x[i-long_term+2:i+1]) / long_term

        if short_mean > long_mean
            sum1 += x[i+2] - x[i+1]
        elseif short_mean < long_mean
            sum1 -= x[i+2] - x[i+1]
        end

    end
    sum1 / (ncases - long_term)
end

function main()
    if length(ARGS) < 4
        println("Usage: julia script.jl <which> <ncases> <trend> <nreps>")
        return
    end

    which = parse(Int, ARGS[1])
    ncases = parse(Int, ARGS[2])
    save_trend = parse(Float64, ARGS[3])
    nreps = parse(Int, ARGS[4])

    optimize(which, ncases, save_trend, nreps)
end

function optimize(which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer, rng=MarsagliaRng(UInt8.([33, 0, 0, 0])))
    @show typeof(rng)
    _optimize(rng, which, ncases, save_trend, nreps)
end

function _optimize(rng, which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer)
    print_results = false

    FloatT = typeof(save_trend)
    x = zeros(FloatT, ncases)

    is_mean = zero(FloatT)
    oos_mean = zero(FloatT)
    futures = Task[]

    @inbounds for irep in 1:nreps
        # Generate in-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_optim = copy(x)

        # Generate out-of-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_oos = copy(x)
        future = Threads.@spawn begin
        params_result = opt_params(which, ncases, x_optim)

        oos_perf = test_system(ncases, x_oos, params_result.short_term, params_result.long_term)
        (params_result, oos_perf)
        end
        push!(futures, future)


    end
    for (irep, future) in enumerate(futures)
        params_result::ParamsResult, oos_perf::Float64 = fetch(future)
        is_mean += params_result.performance
        oos_mean += oos_perf
        if print_results
            @printf("%3d: %3d %3d  %8.4f %8.4f (%8.4f)\n",
                    irep, params_result.short_term, params_result.long_term, params_result.performance, oos_perf, params_result.performance - oos_perf)
        end
    end

    is_mean /= nreps
    oos_mean /= nreps

    println("Mean IS=$is_mean  OOS=$oos_mean  Bias=$(is_mean - oos_mean)")

end

@time main()

The code above, without any @inbounds, @fastmath, or @simd:

Julia 1.9.4:
11.083806 seconds (169.93 k allocations: 16.234 MiB, 1.22% compilation time)

Julia 1.10+rc1:
42.888539 seconds (168.42 k allocations: 16.271 MiB, 0.38% compilation time)

So here I notice a huge regression relative to 1.9.4.

code ran here
using Random
using Printf

struct ParamsResult
    short_term::Int
    long_term::Int
    performance::Float64
end

mutable struct MarsagliaRng
    q::Vector{UInt32}
    carry::UInt32
    mwc256_initialized::Bool
    mwc256_seed::Int32
    i::UInt8

    function MarsagliaRng(seed::Vector{UInt8})
        q = zeros(UInt32, 256)
        carry = 362436
        mwc256_initialized = false
        mwc256_seed = reinterpret(Int32, seed)[1]
        i = 255
        new(q, carry, mwc256_initialized, mwc256_seed, i)
    end
end

# Default constructor using system time for seeding
function MarsagliaRng()
    ts_nano = UInt8(nanosecond(now()))
    seed = [ts_nano, ts_nano >>> 8, ts_nano >>> 16, ts_nano >>> 24]
    MarsagliaRng(seed)
end

# Random number generator functions
function next_u32(rng::MarsagliaRng)::UInt32
    a = UInt64(809430660)

    if !rng.mwc256_initialized
        j = UInt32(rng.mwc256_seed)
        c1 = UInt32(69069)
        c2 = UInt32(12345)
        rng.mwc256_initialized = true
        for k in 1:256
            j = (c1 * j + c2) % UInt32
            rng.q[k] = j
        end
    end

    rng.i = (rng.i + 1) % 256
    t = a * UInt64(rng.q[rng.i+1]) + UInt64(rng.carry)
    rng.carry = (t >> 32)
    rng.q[rng.i+1] = t % UInt32

    return rng.q[rng.i+1]
end

function next_u64(rng::MarsagliaRng)::UInt64
    UInt64(next_u32(rng))
end

# EDIT: extend Base.rand rather than shadowing it.
# Sample function for generating a Float64
function Base.rand(rng::MarsagliaRng)::Float64
    mult = 1.0 / typemax(UInt32)
    mult * next_u32(rng)
end


function opt_params(which, ncases::Integer, x::Vector{Float64})
    FloatT = eltype(x)
    xs = cumsum(x)
    best_perf = typemin(FloatT)
    ibestshort = 0
    ibestlong = 0

    small_float = 1e-60
    for ilong in 2:199
        for ishort in 1:(ilong-1)
            total_return = zero(FloatT)
            win_sum = small_float
            lose_sum = small_float
            sum_squares = small_float
            short_sum = zero(FloatT)
            long_sum = zero(FloatT)


            let i = ilong - 1
                short_sum = xs[i+1] - xs[i-ishort + 2 - 1] 
                long_sum = xs[i+1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x[i+1] :
                      (short_mean < long_mean) ? x[i+1] - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            # General case; i != ilong - 1
            for i in (ilong:ncases-2)
                x1 = x[i+1]
		#short_sum += x1 - x[i-ishort+1]
		#long_sum += x1 - x[i-ilong+1]

                short_sum = xs[i+1] - xs[i-ishort + 1] 
                long_sum = xs[i+1] - xs[i-ilong + 1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x1 :
                      (short_mean < long_mean) ? x1 - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            if which == 0
                total_return /= (ncases - ilong)
                if total_return > best_perf
                    best_perf = total_return
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 1
                pf = win_sum / lose_sum
                if pf > best_perf
                    best_perf = pf
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 2
                total_return /= (ncases - ilong)
                sum_squares /= (ncases - ilong)
                sum_squares -= total_return^2
                sr = total_return / (sqrt(sum_squares) + 1e-8)
                if sr > best_perf
                    best_perf = sr
                    ibestshort = ishort
                    ibestlong = ilong
                end
            end
        end
    end 

    ParamsResult(ibestshort, ibestlong, best_perf)
end

function test_system(ncases::Integer, x::Vector{Float64}, short_term::Integer, long_term::Integer)
    FloatT = eltype(x)
    sum1 = zero(FloatT)

    for i in (long_term-1:ncases-2)
        short_mean = sum(@view x[i-short_term+2:i+1]) / short_term
        long_mean = sum(@view x[i-long_term+2:i+1]) / long_term

        if short_mean > long_mean
            sum1 += x[i+2] - x[i+1]
        elseif short_mean < long_mean
            sum1 -= x[i+2] - x[i+1]
        end

    end
    sum1 / (ncases - long_term)
end

function main()
    if length(ARGS) < 4
        println("Usage: julia script.jl <which> <ncases> <trend> <nreps>")
        return
    end

    which = parse(Int, ARGS[1])
    ncases = parse(Int, ARGS[2])
    save_trend = parse(Float64, ARGS[3])
    nreps = parse(Int, ARGS[4])

    optimize(which, ncases, save_trend, nreps)
end

function optimize(which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer, rng=MarsagliaRng(UInt8.([33, 0, 0, 0])))
    @show typeof(rng)
    _optimize(rng, which, ncases, save_trend, nreps)
end

function _optimize(rng, which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer)
    print_results = false

    FloatT = typeof(save_trend)
    x = zeros(FloatT, ncases)

    is_mean = zero(FloatT)
    oos_mean = zero(FloatT)
    futures = Task[]

    for irep in 1:nreps
        # Generate in-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_optim = copy(x)

        # Generate out-of-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_oos = copy(x)
        future = Threads.@spawn begin
        params_result = opt_params(which, ncases, x_optim)

        oos_perf = test_system(ncases, x_oos, params_result.short_term, params_result.long_term)
        (params_result, oos_perf)
        end
        push!(futures, future)


    end
    for (irep, future) in enumerate(futures)
        params_result::ParamsResult, oos_perf::Float64 = fetch(future)
        is_mean += params_result.performance
        oos_mean += oos_perf
        if print_results
            @printf("%3d: %3d %3d  %8.4f %8.4f (%8.4f)\n",
                    irep, params_result.short_term, params_result.long_term, params_result.performance, oos_perf, params_result.performance - oos_perf)
        end
    end

    is_mean /= nreps
    oos_mean /= nreps

    println("Mean IS=$is_mean  OOS=$oos_mean  Bias=$(is_mean - oos_mean)")

end

@time main()

The code above adding @simd to the inner loop

            @simd for i in (ilong:ncases-2)
                x1 = x[i+1]

Julia 1.9.4:
3.897196 seconds (179.15 k allocations: 16.861 MiB, 3.85% compilation time)

Julia 1.10+rc1:
3.889199 seconds (177.21 k allocations: 16.855 MiB, 4.57% compilation time)

So, basically the performance of the original @foobar_lv2 code is recovered.

code ran here
using Random
using Printf

struct ParamsResult
    short_term::Int
    long_term::Int
    performance::Float64
end

mutable struct MarsagliaRng
    q::Vector{UInt32}
    carry::UInt32
    mwc256_initialized::Bool
    mwc256_seed::Int32
    i::UInt8

    function MarsagliaRng(seed::Vector{UInt8})
        q = zeros(UInt32, 256)
        carry = 362436
        mwc256_initialized = false
        mwc256_seed = reinterpret(Int32, seed)[1]
        i = 255
        new(q, carry, mwc256_initialized, mwc256_seed, i)
    end
end

# Default constructor using system time for seeding
function MarsagliaRng()
    ts_nano = UInt8(nanosecond(now()))
    seed = [ts_nano, ts_nano >>> 8, ts_nano >>> 16, ts_nano >>> 24]
    MarsagliaRng(seed)
end

# Random number generator functions
function next_u32(rng::MarsagliaRng)::UInt32
    a = UInt64(809430660)

    if !rng.mwc256_initialized
        j = UInt32(rng.mwc256_seed)
        c1 = UInt32(69069)
        c2 = UInt32(12345)
        rng.mwc256_initialized = true
        for k in 1:256
            j = (c1 * j + c2) % UInt32
            rng.q[k] = j
        end
    end

    rng.i = (rng.i + 1) % 256
    t = a * UInt64(rng.q[rng.i+1]) + UInt64(rng.carry)
    rng.carry = (t >> 32)
    rng.q[rng.i+1] = t % UInt32

    return rng.q[rng.i+1]
end

function next_u64(rng::MarsagliaRng)::UInt64
    UInt64(next_u32(rng))
end

# EDIT: extend Base.rand rather than shadowing it.
# Sample function for generating a Float64
function Base.rand(rng::MarsagliaRng)::Float64
    mult = 1.0 / typemax(UInt32)
    mult * next_u32(rng)
end


function opt_params(which, ncases::Integer, x::Vector{Float64})
    FloatT = eltype(x)
    xs = cumsum(x)
    best_perf = typemin(FloatT)
    ibestshort = 0
    ibestlong = 0

    small_float = 1e-60
    for ilong in 2:199
        for ishort in 1:(ilong-1)
            total_return = zero(FloatT)
            win_sum = small_float
            lose_sum = small_float
            sum_squares = small_float
            short_sum = zero(FloatT)
            long_sum = zero(FloatT)


            let i = ilong - 1
                short_sum = xs[i+1] - xs[i-ishort + 2 - 1] 
                long_sum = xs[i+1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x[i+1] :
                      (short_mean < long_mean) ? x[i+1] - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            # General case; i != ilong - 1
            @simd for i in (ilong:ncases-2)
                x1 = x[i+1]
		#short_sum += x1 - x[i-ishort+1]
		#long_sum += x1 - x[i-ilong+1]

                short_sum = xs[i+1] - xs[i-ishort + 1] 
                long_sum = xs[i+1] - xs[i-ilong + 1]

                short_mean = short_sum / ishort
                long_mean = long_sum / ilong

                ret = (short_mean > long_mean) ? x[i+2] - x1 :
                      (short_mean < long_mean) ? x1 - x[i+2] : zero(FloatT)

                total_return += ret
                sum_squares += ret^2
                if ret > zero(FloatT)
                    win_sum += ret
                else
                    lose_sum -= ret
                end
            end

            if which == 0
                total_return /= (ncases - ilong)
                if total_return > best_perf
                    best_perf = total_return
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 1
                pf = win_sum / lose_sum
                if pf > best_perf
                    best_perf = pf
                    ibestshort = ishort
                    ibestlong = ilong
                end
            elseif which == 2
                total_return /= (ncases - ilong)
                sum_squares /= (ncases - ilong)
                sum_squares -= total_return^2
                sr = total_return / (sqrt(sum_squares) + 1e-8)
                if sr > best_perf
                    best_perf = sr
                    ibestshort = ishort
                    ibestlong = ilong
                end
            end
        end
    end 

    ParamsResult(ibestshort, ibestlong, best_perf)
end

function test_system(ncases::Integer, x::Vector{Float64}, short_term::Integer, long_term::Integer)
    FloatT = eltype(x)
    sum1 = zero(FloatT)

    for i in (long_term-1:ncases-2)
        short_mean = sum(@view x[i-short_term+2:i+1]) / short_term
        long_mean = sum(@view x[i-long_term+2:i+1]) / long_term

        if short_mean > long_mean
            sum1 += x[i+2] - x[i+1]
        elseif short_mean < long_mean
            sum1 -= x[i+2] - x[i+1]
        end

    end
    sum1 / (ncases - long_term)
end

function main()
    if length(ARGS) < 4
        println("Usage: julia script.jl <which> <ncases> <trend> <nreps>")
        return
    end

    which = parse(Int, ARGS[1])
    ncases = parse(Int, ARGS[2])
    save_trend = parse(Float64, ARGS[3])
    nreps = parse(Int, ARGS[4])

    optimize(which, ncases, save_trend, nreps)
end

function optimize(which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer, rng=MarsagliaRng(UInt8.([33, 0, 0, 0])))
    @show typeof(rng)
    _optimize(rng, which, ncases, save_trend, nreps)
end

function _optimize(rng, which::Integer, ncases::Integer, save_trend::Float64, nreps::Integer)
    print_results = false

    FloatT = typeof(save_trend)
    x = zeros(FloatT, ncases)

    is_mean = zero(FloatT)
    oos_mean = zero(FloatT)
    futures = Task[]

    for irep in 1:nreps
        # Generate in-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_optim = copy(x)

        # Generate out-of-sample
        trend = save_trend
        x[1] = zero(FloatT)
        for i in 2:ncases
            if (i - 1) % 50 == 0
                trend = -trend
            end
            x[i] = x[i-1] + trend + rand(rng) + rand(rng) - rand(rng) - rand(rng)
        end
        x_oos = copy(x)
        future = Threads.@spawn begin
        params_result = opt_params(which, ncases, x_optim)

        oos_perf = test_system(ncases, x_oos, params_result.short_term, params_result.long_term)
        (params_result, oos_perf)
        end
        push!(futures, future)


    end
    for (irep, future) in enumerate(futures)
        params_result::ParamsResult, oos_perf::Float64 = fetch(future)
        is_mean += params_result.performance
        oos_mean += oos_perf
        if print_results
            @printf("%3d: %3d %3d  %8.4f %8.4f (%8.4f)\n",
                    irep, params_result.short_term, params_result.long_term, params_result.performance, oos_perf, params_result.performance - oos_perf)
        end
    end

    is_mean /= nreps
    oos_mean /= nreps

    println("Mean IS=$is_mean  OOS=$oos_mean  Bias=$(is_mean - oos_mean)")

end

@time main()

Conclusions

Here, thus:

  1. The @simd in that inner loop most crucial performance change.
  2. The ::Val{which} stuff doesn’t make any difference for me.
  3. There is a huge performance regression associated to how the inner loop is being lowered in 1.10+rc1 relative to 1.9.4.

I’ve reported an issue about this regression: simd performance regression in 1.10+rc1 relative to 1.9.4 · Issue #52307 · JuliaLang/julia · GitHub

(edit: @inbounds in that inner loop makes no difference, only @simd is enough - does it imply @inbounds?).

2 Likes

It’s not really the most crucial performance change, but rather the last brick in the wall. The most substantial change was to change the moving-window-sum construction:

The original code used the “rolling sum” model, where you update window_sum += x[i] - x[i-offset]. My changed code precomputes xs = cumsum(x), outside of all the loops, and then does window_sum = xs[i] - xs[i - offset].

This has the advantageous effect that loop iteration i+1 is mostly independent of loop iteration i.

You can see how this plays well with SIMD. It also plays well with superscalar / ILP.

I expect that rust / c++ would profit from the same change.

3 Likes

Thanks. The issue I had was that in 1.10 the loop is not being simd’d, so the changes were not effective for me.

@foobar_lv2 out of curiosity and maybe something worthwhile adding to the performance docs, where do trivial allocations in super optimized code like your final version generally creep in in Julia:

julia> @time optimize(0, 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047785  OOS=-0.0031496723033457423  Bias=0.04777531559382359
  3.235483 seconds (639 allocations: 1.570 MiB)

Are all the trivial var = zero(FloatT) counting as allocs since i assume those get stack allocated, I guess other than the views and the initial x allocation i cant see a very obvious path to more allocs. ie this, i would assume (hope) that when a view is used and never escapes it’s stack allocated as views do get used a lot in hot paths and loops, right.

short_mean = sum(@view x[i-short_term+2:i+1]) / short_term
long_mean = sum(@view x[i-long_term+2:i+1]) / long_term

and i guess cumsum can be replaced with cumsum! and preallocate so we save nreps allocs of size x. even then i still get 440 allocs:

function opt_params!(w::Val{which}, ncases::Integer, x::Vector{Float64}, x_cs::Vector{Float64}) where which
    FloatT = eltype(x)
    xs = cumsum!(x_cs, x)
 @time optimize(0, 1000, 0.01, 200)
typeof(rng) = MarsagliaRng
Mean IS=0.04462564329047785  OOS=-0.0031496723033457423  Bias=0.04777531559382359
  3.226711 seconds (440 allocations: 28.594 KiB)

lastly is there a tool that can give you a brekadown of allocs per function in some kind of nice format, i know you can run julia with the memory profiler and get a line by line allocation trace, im more curious for the purposes of CI, have something that generates a table of functions and alloc count / size per function so those can be diffed and whatnot for regression purposes or if the benchmark packages can do something like that for perf CI testing.

The code is not super optimized at all, and especially not for allocations. I had some mental model how the code should work, and then did lazy minimal changes with occasional benchmarks to arrive at that point.

Regarding allocations, xs = cumsum(x), x_optim = copy(x) and x_oos = copy(x) are the major ones, which amount to 200*3 allocs of 8kb each, hence 4.8 mb out of measured 4.841 MiB.

These are necessary in order to allow multi-threading.

Generally, one should keep allocs out of the inner loops and try to reuse buffers where it makes sense. But as long as the allocation density is low, allocations are not a performance killer.

2 Likes

This might interest you, JuliaLang/AllocCheck.jl: AllocCheck (github.com).

1 Like

@feribg - so, what’s the verdict vs C++/Rust after all these optimizations :nerd_face: