The same async program run on (Linux+AMD) vs (Win+Intel), over 6x performance difference

WalterMadelim · October 19, 2025, 6:09am

Long story short:
I wrote a async programming code which uses multithreads. The same code is run on two servers. The Linux server is 6.5 times faster than the other (both attaining the same correct results). I’m not hurry since one of the result is already satisfactory for me. But I’m still curious about the underlying reason. Hope to hear any ideas.

Details:
The physical problem is a coordination problem about mathematical optimization.
It happens that I have both a linux and a win10 server at present. So I did a comparative test.
The win10 server has 52 physical cores, while the linux server has 2*64 physical cores.
Since the linux’s hardware is stronger, the (same) source code was tailored for the win10’s for fairness. And in addition, the Threads.nthreads() is both equal to 52 when testing.

Results:

# This is linux's result
masterˈs_algorithm (Dual Opt) time = 325.7915050983429
Primal Opt time = 2.2860770225524902

# This is win10's result
masterˈs_algorithm (Dual Opt) time = 2114.335000038147
Primal Opt time = 4.42900013923645

The Dual Opt is where async programming occurs, the linux is 6.5 times faster.
The Primal Opt is merely a Gurobi (black-box, centralized) solve, so the the performance difference is only 2x.

Platform:

julia> versioninfo()
Julia Version 1.12.1
Commit ba1e628ee49 (2025-10-17 13:02 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 256 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver3)
  GC: Built with stock GC
Threads: 52 default, 1 interactive, 52 GC (on 256 virtual cores)

julia> versioninfo()
Julia Version 1.12.1
Commit ba1e628ee4 (2025-10-17 13:02 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 52 × Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, cascadelake)
  GC: Built with stock GC
Threads: 52 default, 1 interactive, 52 GC (on 104 virtual cores)

Src Code Tested

import JuMP, Gurobi
import JuMP.value as ı
import LinearAlgebra.⋅ as ⋅
import Statistics, Random
const GRB_ENV = Gurobi.Env();
const my_seed = 16
const K = 20;
const LOGIS = 52;

macro get_int_decision(model, expr) return esc(quote
    let e = JuMP.@expression($model, $expr), a
        a = map(_ -> JuMP.@variable($model, integer = true), e)
        JuMP.@constraint($model, a .== e)
        a
    end
end) end;
function get_simple_model()
    m = JuMP.direct_model(Gurobi.Optimizer(GRB_ENV))
    JuMP.set_silent(m)
    JuMP.set_attribute(m, "Threads", 1)
    m
end;
function get_Bool_value(x)
    f = z -> round(Bool, ı(z))
    ndims(x) == 0 && return f(x)
    map(f, x)
end;
function get_C_and_O()
    C = [
        [10, 9, 9, 9, 10, 12, 15, 18, 20, 18, 16, 15, 14, 15, 16, 18, 22, 28, 32, 30, 26, 20, 15, 12],
        [12, 11, 11, 12, 14, 18, 24, 26, 22, 18, 15, 14, 13, 14, 18, 24, 30, 34, 32, 28, 22, 18, 15, 13],
        [16, 15, 14, 14, 15, 18, 24, 30, 28, 22, 18, 12, 10, 12, 16, 22, 28, 34, 36, 32, 28, 24, 20, 18],
        [8, 8, 8, 9, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 36, 34, 30, 26, 20, 14],
        [5, 5, 5, 6, 8, 12, 18, 24, 28, 32, 36, 38, 36, 34, 30, 28, 26, 24, 22, 20, 18, 14, 10, 8]
    ]
    O = [
        [28,28,27,27,28,29,31,33,35,36,37,38,38,38,37,36,35,34,32,31,30,29,29,28],
        [29,28,28,28,29,31,33,35,37,39,40,41,42,42,41,40,38,36,34,33,32,31,30,29],
        [27,27,27,27,28,29,30,32,33,34,35,35,35,35,34,33,32,31,30,29,28,28,28,27],
        [32,32,31,31,32,34,36,38,40,42,43,44,44,44,43,42,41,40,38,37,36,35,34,33],
        [27,27,27,27,28,29,30,33,35,37,38,39,39,39,38,37,36,34,32,31,30,29,28,27]
    ]
    return C[rand(1:5)], O[rand(1:5)]
end;
function get_pair_and_self_Rng(J)
    d = 2J÷4
    1:d, d+1:J # Rng1, Rng2
end;
function prev(t, d, T) return (n = t-d; n<1 ? T+n : n) end;
function pc_P_AC(O, OH, CND, Q_I, COP) return ceil(Int, ((maximum(O) - OH)CND + maximum(Q_I)) / COP) end;
function get_E_ES(Rng)::NamedTuple
    M = rand(Rng) # Max E
    i = rand(0:M) # initial SOC _is_ this value
    e = rand(0:min(M, 21)) # ending SOC should ≥ this value
    (; i, M, e)
end;
function add_ES_module!(model, P_ES, E_ES)
    pES = ( # eES is dependent variable
        c = JuMP.@variable(model, [1:T], lower_bound = 0),
        d = JuMP.@variable(model, [1:T], lower_bound = 0)
    ); bES = JuMP.@variable(model, [1:T], Bin)
        JuMP.@constraint(model, pES.c .≤ (bES)P_ES.c)
        JuMP.@constraint(model, pES.d .≤ (1 .- bES)P_ES.d)
    eES = JuMP.@variable(model, [t=1:T], lower_bound = t<T ? 0 : E_ES.e, upper_bound = E_ES.M)
    JuMP.@constraint(model, [t=1:T], pES.c[t]*.95 - pES.d[t]/.95 == eES[t]-(t>1 ? eES[t-1] : E_ES.i))
    return pES, bES, eES
end;
function gen_ac_data()::Tuple
    CND   = .5rand(1:7) 
    INR   = rand(6:20)  
    COP   = rand(2:.5:4)
    Q_I   = rand(3:9, T)
    Q_BUS = rand(25:35) 
    OH    = rand(24:29) 
    OΔ    = rand(4:9)   
    P_AC  = pc_P_AC(O, OH, CND, Q_I, COP)
    return CND, INR, COP, Q_I, Q_BUS, OH, OΔ, P_AC
end;
function add_AC_module!(model, O, CND, INR, COP, Q_I, Q_BUS, OH, OΔ, P_AC)
    pAC = JuMP.@variable(model, [1:T], lower_bound = 0, upper_bound = P_AC)
    o = JuMP.@variable(model, [1:T], lower_bound = OH-OΔ, upper_bound = OH)
    q = JuMP.@variable(model, [1:T], lower_bound = 0, upper_bound = Q_BUS)
    JuMP.@constraint(model, [t=1:T], (O[t]-o[t])CND + Q_I[t] -q[t] -pAC[t]COP == (o[t<T ? t+1 : 1]-o[t])INR)
    return o, q, pAC
end;
function add_U_module!(model, U)
    bU::Matrix{JuMP.VariableRef} = JuMP.@variable(model, [t = 1:T, i = eachindex(U)], Bin)
    JuMP.@constraint(model, sum(bU; dims = 1) .≥ true)
    pU = JuMP.@expression(model, [t=1:T], sum(sum(bU[prev(t,φ-1,T), i]P for (φ,P) = enumerate(v)) for (i,v) = enumerate(U))) # Vector{JuMP.AffExpr}
    return bU, pU
end;
function add_self_EV_module!(model, P_EV, E_EV) # self means a household in a block with cardinality 1
    bEV, pEV = JuMP.@variable(model, [1:T], Bin), JuMP.@variable(model, [1:T])
    JuMP.@constraint(model, (P_EV.m)bEV .≤ pEV)
    JuMP.@constraint(model, pEV .≤ (P_EV.M)bEV)
    JuMP.@constraint(model, sum(pEV) ≥ E_EV)
    return bEV, pEV # bEV can be _inferred from_ pEV
end;
function pc_self_P_BUS(D, U, P_EV, E_EV, O, CND, INR, COP, Q_I, OH, OΔ, P_AC)::Int
    model = get_simple_model()
    bU, pU = add_U_module!(model, U)
    bEV, pEV = add_self_EV_module!(model, P_EV, E_EV)
    o, q, pAC = add_AC_module!(model, O, CND, INR, COP, Q_I, 0, OH, OΔ, P_AC) # Q_BUS = 0
    pBus = JuMP.@variable(model)
    JuMP.@constraint(model, pBus .≥ D + pU + pEV + pAC) # No G | ES
    JuMP.@objective(model, Min, pBus)
    @lock insset_lock push!(insset, model)
    JuMP.optimize!(model)
    @lock insset_lock delete!(insset, model)
    ps, ts = JuMP.primal_status(model), JuMP.termination_status(model)
    if ps != JuMP.FEASIBLE_POINT || ts ∉ [JuMP.OPTIMAL, JuMP.INTERRUPTED]
        error(string(ps, ts))
    end
    val = JuMP.objective_value(model)
    val > 0 || error("The self household has P_BUS = $val")
    ceil(Int, val) # P_BUS
end;
function add_EV_1_module!(model, P_EV_1, E_EV_1)
    bLent, pLent = JuMP.@variable(model, [1:T], Bin), JuMP.@variable(model, [1:T])
    bEV_1, pEV_1 = JuMP.@variable(model, [1:T], Bin), JuMP.@variable(model, [1:T])
    JuMP.@constraint(model, bEV_1 .≤ bLent)
    JuMP.@constraint(model, (1 .- bLent)P_EV_1.m .≤ pLent)
    JuMP.@constraint(model, pLent .≤ (1 .- bLent)P_EV_1.M)
    JuMP.@constraint(model, (P_EV_1.m)bEV_1 .≤ pEV_1)
    JuMP.@constraint(model, pEV_1 .≤ (P_EV_1.M)bEV_1)
    JuMP.@constraint(model, sum(pEV_1) ≥ E_EV_1)
    return bEV_1, pEV_1, bLent, pLent
end;
function add_EV_2_module!(model, P_EV_2, E_EV_2, bLent, pLent)
    bEV_2, pEV_2 = JuMP.@variable(model, [1:T], Bin), JuMP.@variable(model, [1:T])
    JuMP.@constraint(model, bEV_2 .≤ bLent)
    JuMP.@constraint(model, (P_EV_2.m)bEV_2 .≤ pEV_2)
    JuMP.@constraint(model, pEV_2 .≤ (P_EV_2.M)bEV_2)
    JuMP.@constraint(model, sum(pEV_2 + pLent) ≥ E_EV_2)
    return bEV_2, pEV_2
end;
function add_self_circuit_breaker_module!(model, P_BUS, D, pU, pEV, pAC)
    pBus = JuMP.@variable(model, [1:T], lower_bound = 0, upper_bound = P_BUS)
    JuMP.@constraint(model, pBus .≥ D + pU + pEV + pAC) # No G | ES
    return pBus
end;
function add_circuit_breaker_pair_module!(model, P_BUS_1, P_BUS_2, p_ES, G, pLent, pEV_1, pU_1, pAC_1, D_1, pEV_2, pU_2, pAC_2, D_2)
    pBus_1 = JuMP.@variable(model, [1:T], lower_bound = -P_BUS_1, upper_bound = P_BUS_1)
    pBus_2 = JuMP.@variable(model, [1:T], lower_bound = 0, upper_bound = P_BUS_2)
    JuMP.@constraint(model, pBus_1 .== p_ES.c -p_ES.d -G + pLent + pEV_1 + pU_1 + pAC_1 + D_1)
    JuMP.@constraint(model, pBus_2 .≥ pEV_2 + pU_2 + pAC_2 + D_2)
    return pBus_1, pBus_2
end;
function get_a_paired_block(O)::NamedTuple
    model = get_simple_model() # for a block who has a lender and a borrower house
    # 6 lines
    G = rand(0:17, T)
    D_1 = rand(0:5, T)
    P_ES, E_ES = (c = rand(1:6), d = rand(1:6)), get_E_ES(19:55)
    U_1 = [rand(1:4, rand(2:5)) for _ = 1:rand(1:4)] # each entry is a cycle vector of an uninterruptible load 
    P_EV_1, E_EV_1 = (m = rand((1., 1.5)), M = rand(3:7)), rand(10:39)
    CND_1, INR_1, COP_1, Q_I_1, Q_BUS_1, OH_1, OΔ_1, P_AC_1 = gen_ac_data()
    # lender house
    pES, bES, eES = add_ES_module!(model, P_ES, E_ES)
    bU_1, pU_1 = add_U_module!(model, U_1)
    bEV_1, pEV_1, bLent, pLent = add_EV_1_module!(model, P_EV_1, E_EV_1)
    o_1, q_1, pAC_1 = add_AC_module!(model, O, CND_1, INR_1, COP_1, Q_I_1, 0, OH_1, OΔ_1, P_AC_1) # Q_BUS = 0
    # 4 lines
    D_2 = rand(0:5, T) # borrower house
    U_2 = [rand(1:4, rand(2:5)) for _ = 1:rand(1:4)] # each entry is a cycle vector of an uninterruptible load 
    P_EV_2, E_EV_2 = (m = rand((1., 1.5)), M = rand(3:7)), rand(10:39)
    CND_2, INR_2, COP_2, Q_I_2, Q_BUS_2, OH_2, OΔ_2, P_AC_2 = gen_ac_data()
    # borrower house
    bU_2, pU_2 = add_U_module!(model, U_2)
    bEV_2, pEV_2 = add_EV_2_module!(model, P_EV_2, E_EV_2, bLent, pLent)
    o_2, q_2, pAC_2 = add_AC_module!(model, O, CND_2, INR_2, COP_2, Q_I_2, 0, OH_2, OΔ_2, P_AC_2) # Q_BUS = 0
    # determine the circuit breaker limit
    pBus_2 = JuMP.@variable(model, [1:T], lower_bound = 0)
    temp_x = JuMP.@variable(model)
    temp_c = JuMP.@constraint(model, pBus_2 .== temp_x)
    JuMP.@constraint(model, pBus_2 .≥ pEV_2 + pU_2 + pAC_2 + D_2)
    JuMP.@objective(model, Min, temp_x)
    @lock insset_lock push!(insset, model)
    JuMP.optimize!(model)
    @lock insset_lock delete!(insset, model)
    ps, ts = JuMP.primal_status(model), JuMP.termination_status(model)
    if ps != JuMP.FEASIBLE_POINT || ts ∉ [JuMP.OPTIMAL, JuMP.INTERRUPTED]
        error(string(ps, ts))
    end
    temp_float64 = ı(temp_x)
    temp_float64 > 0 || error("common pBus_2 has value $temp_float64")
    P_BUS_2 = ceil(Int, temp_float64)
    JuMP.delete(model, temp_c)
    JuMP.delete(model, temp_x)
    JuMP.set_upper_bound.(pBus_2, P_BUS_2)
    temp_x = JuMP.@variable(model) # reuse the local name
    JuMP.@constraint(model, -temp_x .≤ pES.c -pES.d -G + pLent + pEV_1 + pU_1 + pAC_1 + D_1)
    JuMP.@constraint(model,  temp_x .≥ pES.c -pES.d -G + pLent + pEV_1 + pU_1 + pAC_1 + D_1)
    JuMP.@objective(model, Min, temp_x)
    @lock insset_lock push!(insset, model)
    JuMP.optimize!(model)
    @lock insset_lock delete!(insset, model)
    ps, ts = JuMP.primal_status(model), JuMP.termination_status(model)
    if ps != JuMP.FEASIBLE_POINT || ts ∉ [JuMP.OPTIMAL, JuMP.INTERRUPTED]
        error(string(ps, ts))
    end
    temp_float64 = ı(temp_x)
    temp_float64 > -1e-5 || error("pBus_1 has value $temp_float64")
    P_BUS_1 = max(1, ceil(Int, temp_float64))
    (;P_BUS_1, P_BUS_2, G, P_ES, E_ES, D_1, D_2, U_1, U_2, P_EV_1, P_EV_2,
    E_EV_1, E_EV_2, CND_1, CND_2, INR_1, INR_2, COP_1, COP_2, Q_I_1, Q_I_2,
    Q_BUS_1, Q_BUS_2, OH_1, OH_2, OΔ_1, OΔ_2, P_AC_1, P_AC_2)
end; 
function get_a_self_block(O)::NamedTuple
    D = rand(0:5, T) # base demand
    U = [rand(1:4, rand(2:5)) for _ = 1:rand(1:4)] # each entry is a cycle vector of an uninterruptible load 
    P_EV, E_EV = (m = rand((1., 1.5)), M = rand(3:7)), rand(10:39)
    CND, INR, COP, Q_I, Q_BUS, OH, OΔ, P_AC = gen_ac_data()
    P_BUS = pc_self_P_BUS(D, U, P_EV, E_EV, O, CND, INR, COP, Q_I, OH, OΔ, P_AC)
    (;P_BUS, D, U, P_EV, E_EV, CND, INR, COP, Q_I, Q_BUS, OH, OΔ, P_AC)
end;
function add_a_paired_block!(model, d::NamedTuple)::NamedTuple
    pES, bES, eES = add_ES_module!(model, d.P_ES, d.E_ES)
    bU_1, pU_1 = add_U_module!(model, d.U_1)
    bEV_1, pEV_1, bLent, pLent = add_EV_1_module!(model, d.P_EV_1, d.E_EV_1)
    o_1, q_1, pAC_1 = add_AC_module!(model, O, d.CND_1, d.INR_1, d.COP_1, d.Q_I_1, 0, d.OH_1, d.OΔ_1, d.P_AC_1) # Q_BUS = 0
    bU_2, pU_2 = add_U_module!(model, d.U_2)
    bEV_2, pEV_2 = add_EV_2_module!(model, d.P_EV_2, d.E_EV_2, bLent, pLent)
    o_2, q_2, pAC_2 = add_AC_module!(model, O, d.CND_2, d.INR_2, d.COP_2, d.Q_I_2, 0, d.OH_2, d.OΔ_2, d.P_AC_2) # Q_BUS = 0
    pBus_1, pBus_2 = add_circuit_breaker_pair_module!(model, d.P_BUS_1, d.P_BUS_2,
        pES, d.G, pLent, pEV_1, pU_1, pAC_1, d.D_1,
        pEV_2, pU_2, pAC_2, d.D_2)
    (;pBus_1, pBus_2, bLent, bES, bEV_1, bEV_2, bU_1, bU_2, q_1, q_2)
end;
function add_a_self_block!(model, d::NamedTuple)::NamedTuple
    bU, pU = add_U_module!(model, d.U)
    bEV, pEV = add_self_EV_module!(model, d.P_EV, d.E_EV)
    o, q, pAC = add_AC_module!(model, O, d.CND, d.INR, d.COP, d.Q_I, 0, d.OH, d.OΔ, d.P_AC) # Q_BUS = 0
    pBus = add_self_circuit_breaker_module!(model, d.P_BUS, d.D, pU, pEV, pAC)
    (;pBus, bEV, bU, q)
end;
function set_qˈs_ub!(D, X)
    for j = Rng1
        JuMP.set_upper_bound.(X[j].q_1, D[j].Q_BUS_1)
        JuMP.set_upper_bound.(X[j].q_2, D[j].Q_BUS_2)
    end
    for j = Rng2
        JuMP.set_upper_bound.(X[j].q, D[j].Q_BUS)
    end
end;
function get_Q_A(model)::Vector{Int} return map(x -> ceil(Int, ı(x)rand(0.25:0.05:0.95)), model[:qA]) end;
function get_E_Q(model)::Int return round(Int, rand(0.25:0.05:0.85)sum(ı.(model[:qA]))) end;
function initialize_out(an_UB)
    model = get_simple_model()
    JuMP.@variable(model, β[1:T] ≥ 0)
    JuMP.@constraint(model, sum(β) == 1) # ⚠️ special
    JuMP.@variable(model, θ[1:J])
    JuMP.@expression(model, out_obj_tbMax, sum(θ))
    JuMP.@objective(model, Max, out_obj_tbMax)
    JuMP.@constraint(model, out_obj_tbMax ≤ an_UB)
    return model, θ, β
end;
function bilin_expr(j, iˈı::Function, β) # pivot
    JuMP.@expression(model, sum(iˈı(p)b for (b, p) = zip(β,
        j ∈ Rng1 ? X[j].pBus_1 + X[j].pBus_2 : X[j].pBus
    )))
end;
function subproblemˈs_duty(j, ref, inbox)
    s = getfield(ref, :x)
    mj = inn[j]
    JuMP.@objective(mj, Min, bilin_expr(j, identity, s.β))
    setindex!(is_inn_solving_vec, true, j); JuMP.optimize!(mj); setindex!(is_inn_solving_vec, false, j)
    @lock tnt_lock push!(getfield(tnt, ifelse(j in Rng1, :pair, :self)), JuMP.solve_time(mj))
    ps, ts = JuMP.primal_status(mj), JuMP.termination_status(mj)
    lens = if ps != JuMP.FEASIBLE_POINT || ts ∉ [JuMP.OPTIMAL, JuMP.INTERRUPTED]
        function(s); j, (ps, ts), missing, missing end # to be thrown
    else
        con = JuMP.@build_constraint(θ[j] ≤ bilin_expr(j, ı, β)) # always a VI
        can_cut_off = function(s)
            vio_degree = s.θ[j] - bilin_expr(j, ı, s.β)
            vio_degree > COT, vio_degree
        end
        ver = let x::NamedTuple = X[j]
            j in Rng1 ? (
                bES = get_Bool_value(x.bES),
                bLent = get_Bool_value(x.bLent),
                bEV_1 = get_Bool_value(x.bEV_1),
                bEV_2 = get_Bool_value(x.bEV_2),
                bU_1 = get_Bool_value(x.bU_1),
                bU_2 = get_Bool_value(x.bU_2),
                pBus = mapreduce(v -> map(ı, v), +, (x.pBus_1, x.pBus_2))
            ) : (
                bEV = get_Bool_value(x.bEV),
                bU = get_Bool_value(x.bU),
                pBus = map(ı, x.pBus)
            )
        end
        function(s); j, can_cut_off(s), con, ver end
    end
    @lock inbox_lock push!(inbox, lens)
end;
function wait_until_all_started(taskvec)
    in_pool = Set(eachindex(taskvec))
    while true
        isempty(in_pool) && return
        progress = false
        for j = in_pool
            if istaskstarted(taskvec[j])
                pop!(in_pool, j)
                progress = true
                break
            end
        end
        progress && continue
        yield()
    end
end;
function shot!(timestamp)
    JuMP.optimize!(model)
    JuMP.termination_status(model) == JuMP.OPTIMAL || error("$(JuMP.termination_status(model))")
    @lock tnt_lock push!(tnt.master, JuMP.solve_time(model))
    snap = (t = timestamp += 1, θ = ı.(θ), β = ı.(β), ub = JuMP.objective_bound(model))
    timestamp, snap
end;
function warm_up()
    inbox, js_remains = Function[], Set(1:J)
    _, snap = shot!(0)
    ref = Ref(snap)
    tasks = map(j -> Threads.@spawn(subproblemˈs_duty(j, ref, inbox)), 1:J)
    wait_until_all_started(tasks)
    t0 = time()
    while true
        if isempty(inbox)
            if time() - t0 > 5
                jStuck = first(js_remains)
                if is_inn_solving_vec[jStuck]
                    Gurobi.GRBterminate(JuMP.backend(inn[jStuck]))
                    printstyled("\nWarning: manually interrupting inn[$jStuck]..."; color = :yellow)
                    while is_inn_solving_vec[jStuck] yield() end
                    printstyled("\rWarning: manually interrupt inn[$jStuck]: Success\n"; color = :yellow)
                    t0 = time()
                end
            end
            yield()
            continue
        end
        while !isempty(inbox)
            lens = @lock inbox_lock pop!(inbox)
            j, (p, t), con, ver = lens(snap)
            ismissing(con) && error("Subproblem $j terminates with $t and PrimalStatus $p")
            JuMP.add_constraint(model, con), push!(VCG[j], ver)
            pop!(js_remains, j)
            print("\r rest = $(length(js_remains)), j = $j")
            isempty(js_remains) && return foreach(wait, tasks)
        end
        t0 = time()
    end
end;
function masterˈs_loop(ref, timestamp, inbox)
    v, i = fill(0, J), 0
    snap = getfield(ref, :x)
    foreach(empty!, tnt)
    t0_0 = t0 = time()
    while true
        if isempty(inbox) # no event happens
            dt = time() - t0
            dt > 2700 && error("masterˈs_loop have run $dt seconds")
            if time() - t0_0 > 5
                jStuck = findfirst(j -> j ∉ view(v, 1:i), 1:J)
                if is_inn_solving_vec[jStuck]
                    Gurobi.GRBterminate(JuMP.backend(inn[jStuck]))
                    printstyled("\nWarning: manually interrupting inn[$jStuck]..."; color = :yellow)
                    while is_inn_solving_vec[jStuck] yield() end
                    printstyled("\rWarning: manually interrupt inn[$jStuck]: Success\n"; color = :yellow)
                    t0_0 = time()
                end
            end
            yield()
            continue
        end # ∀ j, we have j ∈ buffer ∪ inbox ∪ js_at_large
        up = false
        while true
            lens = @lock inbox_lock pop!(inbox)
            j, (cut_off_by_j, ø), con, ver = lens(snap)
            ismissing(con) && error("Subproblem $j terminates with $ø and PrimalStatus $cut_off_by_j")
            if cut_off_by_j
                _, _, up = JuMP.add_constraint(model, con), push!(VCG[j], ver), true
                print("\r▶ ub = $(snap.ub) | t = $(snap.t), j = $j | vio = $ø | $(ceil(Int, time() - t0)) sec")
            end
            v[i+=1] = j # write the buffer
            isempty(inbox) && break
            (up && i == LOGIS) && break # early break to update master model, i.e. snap
        end
        if up
            timestamp, snap = shot!(timestamp)
            setfield!(ref, :x, snap)
            for j = view(v, 1:i) Threads.@spawn subproblemˈs_duty(j, ref, inbox) end
            i, t0_0 = 0, time()
        elseif i == J
            printstyled("▶▶▶ No more progress can be made, quit!\n"; color = :cyan)
            return
        end
    end
end;
function masterˈs_algorithm()
    t0 = time()
    timestamp, inbox = -1, Function[]
    timestamp, snap = shot!(timestamp)
    ref = Ref(snap)
    map(j -> Threads.@spawn(subproblemˈs_duty(j, ref, inbox)), 1:J) |> wait_until_all_started
    Threads.@spawn(masterˈs_loop(ref, timestamp, inbox)) |> wait
    printstyled("masterˈs_algorithm (Dual Opt) time = $(time()-t0)\n"; color = :cyan)
end;
function fill_model_D_X!(v::Vector, X)
    z = Threads.Atomic{Int}(J)
    f = function(j)
        p = j in Rng1
        X[j] = ifelse(p, add_a_paired_block!, add_a_self_block!)(
            v[j],
            ifelse(p, get_a_paired_block, get_a_self_block)(O)
        )
        Threads.atomic_sub!(z, 1)
        print("\rrest = $(z.value), j = $j")
    end
    tasks, js_remains = map(j -> Threads.@spawn(f(j)), 1:J), Set(1:J)
    t0 = time()
    while !isempty(js_remains)
        progress_j = 0
        for j = js_remains
            istaskdone(tasks[j]) && (progress_j = j; break)
        end
        if progress_j > 0
            pop!(js_remains, progress_j)
            t0 = time()
        elseif time() - t0 > 15
            foreach(Gurobi.GRBterminate ∘ JuMP.backend, insset)
            printstyled("\nWarning: $(time()) terminate all solves\n"; color = :yellow)
            @lock insset_lock empty!(insset)
            t0 = time()
        else
            yield()
        end
    end
    return foreach(wait, tasks)
end;
function get_prob_decision(model, v::Vector{NamedTuple})
    x = map(_ -> JuMP.@variable(model, lower_bound = 0), v)
    JuMP.@constraint(model, sum(x) == 1)
    x
end;
function build_prm!(model)
    JuMP.unset_silent(model)
    l = map(v -> get_prob_decision(model, v), VCG)
    Y = [j in Rng1 ? (
        bES = @get_int_decision(model, sum((t.bES)l for (l, t) = zip(l[j], VCG[j]))),
        bU_1 = @get_int_decision(model, sum((t.bU_1)l for (l, t) = zip(l[j], VCG[j]))),
        bU_2 = @get_int_decision(model, sum((t.bU_2)l for (l, t) = zip(l[j], VCG[j]))),
        bEV_1 = @get_int_decision(model, sum((t.bEV_1)l for (l, t) = zip(l[j], VCG[j]))),
        bEV_2 = @get_int_decision(model, sum((t.bEV_2)l for (l, t) = zip(l[j], VCG[j]))),
        bLent = @get_int_decision(model, sum((t.bLent)l for (l, t) = zip(l[j], VCG[j]))),
        pBus = JuMP.@expression(model, sum((t.pBus)l for (l, t) = zip(l[j], VCG[j])))
    ) : (
        bU = @get_int_decision(model, sum((t.bU)l for (l, t) = zip(l[j], VCG[j]))),
        bEV = @get_int_decision(model, sum((t.bEV)l for (l, t) = zip(l[j], VCG[j]))),
        pBus = JuMP.@expression(model, sum((t.pBus)l for (l, t) = zip(l[j], VCG[j])))
    ) for j = 1:J]
    JuMP.@variable(model, e)
    JuMP.@constraint(model, e .≥ sum(t.pBus for t = Y))
    JuMP.@objective(model, Min, e)
    return Y
end;


Random.seed!(my_seed);

const J = (K)LOGIS;
const T = 24;
const (Rng1, Rng2) = get_pair_and_self_Rng(J);
const (C, O) = get_C_and_O(); # price and Celsius vector
const X = Vector{NamedTuple}(undef, J);
const COT = 0.5/J;
const inbox_lock = Base.ReentrantLock();
const is_inn_solving_vec = falses(J);
const inn = [get_simple_model() for j = 1:J];
const VCG = [NamedTuple[] for _ = 1:J]; # collect the Vertices found in the CG algorithm
const model, θ, β = initialize_out(30J); # ⚠️⚠️⚠️
const insset = Set{JuMP.Model}()
const insset_lock = Base.ReentrantLock();
fill_model_D_X!(inn, X)

const tnt_lock = Base.ReentrantLock();
const tnt = (master = Float64[], pair = Float64[], self = Float64[]) # wordload distribution analysis
warm_up()

masterˈs_algorithm()
_, snap = shot!(0) 

map(maximum, tnt)
map(Statistics.mean, tnt)
map(Statistics.median, tnt)
map(sum, tnt)

const prm = get_simple_model();
JuMP.set_attribute(prm, "Threads", 8)
const Y = build_prm!(prm);
JuMP.optimize!(prm)
printstyled("Primal Opt time = $(JuMP.solve_time(prm))"; color = :cyan)
JuMP.termination_status(prm) == JuMP.OPTIMAL || error("fails");
JuMP.objective_value(prm)

karei · October 19, 2025, 6:35am

Did you try MKL rather than openBLAS on Intel CPU?

WalterMadelim · October 19, 2025, 8:29am

I have no custom settings.

I’m inexperienced about BLAS. But as far as I know they are packages related to linearalgebra computations. In my program the main (I guess almost all) numeric computation workload are undertaken by the Gurobi solver. It seems that the Gurobi solver has its own implementation about these low-level math operations? If we compare the doc of Ipopt.jl and Gurobi.jl, we can see that the MKL is mentioned by the former but not the latter.

Some additional info:

I found that during the computation the intel CPUs are relatively lazy so they are not actively engaged during the async algorithm.
Memory resources are sufficient for both platform.
The tests were done without other irrelevant tasks’ interference.

sgaure · October 19, 2025, 8:58pm

There was a recent thread about something similar. I have not used windows, so don’t know anything about whether it’s related.

WalterMadelim · October 19, 2025, 11:37pm

I do launch them both from VScode terminals. But I didn’t run other tasks, so only the tested code are run.

By running this program

import Base.Threads.@spawn
function fib(n::Int)
    if n < 2
        return n
    end
    t = @spawn fib(n - 2)
    r = fib(n - 1) + fetch(t)
    print("\r   $t, $r    ")
    return r
end
fib(257)

All cores can be engaged 100% completely, on both platforms. (if not sufficient, start more than one terminals and run them in parallel, until observing 100% cpu occupation).

WalterMadelim · October 23, 2025, 9:38am

I think I’ve to some extent found the cause:

The Linux server is totally normal: the cores can be engaged in calculations.

The windows machine can perform ordinary program in parallel, but when it comes to Gurobi MILP solve, the multicore function was not activated out of some unknown reason.

I think it would be better for user to do some normality check before performing parallel computations.

Here are some me-invented simple but useful normality tests:

For ordinary computing:

function busy(v)
    for i = 1:typemax(Int)
        circshift!(v, 7)
        reverse!(v)
    end
end;

foreach(1:78) do _
    Threads.@spawn busy(collect(1:1000))
end

suppose we have 104 logical processors, then the occupation should be at 75%.

For Gurobi’s MILP computation:

import JuMP, Gurobi, SparseArrays
N = 100;
Q = 1. * SparseArrays.dropzeros(
    SparseArrays.spdiagm(
        [i => rand(-9:9, N - i) for i in 0:div(N, 2)]...
    )
);
const GRB_ENV = Gurobi.Env();

L = 1 * 256 ÷ 4
v = map(1:L) do _
    m = JuMP.direct_model(Gurobi.Optimizer(GRB_ENV));
    JuMP.set_silent(m)
    JuMP.set_attribute(m, "threads", 1)
    JuMP.@variable(m, rand(-7:-3) <= y[1:N] <= rand(3:7));
    JuMP.@variable(m, rand(-9:-5) <= x[1:N] <= rand(5:9));
    JuMP.@objective(m, Min, (x'Q)y);
    m
end;
t = map(j -> Threads.@spawn(JuMP.optimize!(v[j])), 1:L)

suppose we have 256 logical processors, then the occupation should be at 25%.

For my linux machine, these two tests passed. So it is completely normal.

But for my win machine, the Gurobi MILP tests completely fails (at a very low occupation rate, if you set L to be full).

Topic		Replies	Views
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2769	December 1, 2021
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	1046	May 13, 2024
Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac Performance performance , parallel , differentialequation	41	3891	February 2, 2022
Decrease in performance using Threads.@threads in Linux Julia at Scale	16	2058	July 23, 2019
Disappointing benchmark results with AMD Threadripper PRO 3975WX 32 Cores Performance performance , multithreading	10	921	June 6, 2023

The same async program run on (Linux+AMD) vs (Win+Intel), over 6x performance difference

Related topics