Optimizing performance

I try to optimize the performance of the following function:

function iterateOPs!(::IterateOPs_basic, wf::WindFarm, sim::Sim, floris::Floris, floridyn::FloriDyn)
    # Save turbine OPs
    tmpOPStates = copy(wf.States_OP[wf.StartI, :])
    tmpTStates  = copy(wf.States_T[wf.StartI, :])
    tmpWFSTates = copy(wf.States_WF[wf.StartI, :])

    # Shift states
    # Downwind step
    step_dw = sim.time_step .* wf.States_WF[:, 1] .* sim.dyn.advection
    wf.States_OP[:, 4] .+= step_dw

    # Crosswind step
    deflection = centerline(wf.States_OP, wf.States_T, wf.States_WF, floris, wf.D[1])
    step_cw = deflection .- wf.States_OP[:, 5:6]
    wf.States_OP[:, 5:6] .= deflection

    # World coordinate system adjustment
    phiW = angSOWFA2world.(wf.States_WF[:, 2])
    wf.States_OP[:, 1] .+= cos.(phiW) .* step_dw .- sin.(phiW) .* step_cw[:, 1]
    wf.States_OP[:, 2] .+= sin.(phiW) .* step_dw .+ cos.(phiW) .* step_cw[:, 1]
    wf.States_OP[:, 3] .+= step_cw[:, 2]

    # Circshift & init first OPs
    # OPs
    wf.States_OP = circshift(wf.States_OP, (1, 0))
    wf.States_OP[wf.StartI, :] = tmpOPStates

    # Turbines
    wf.States_T = circshift(wf.States_T, (1, 0))
    wf.States_T[wf.StartI, :] = tmpTStates

    # Wind Farm
    wf.States_WF = circshift(wf.States_WF, (1, 0))
    wf.States_WF[wf.StartI, :] = tmpWFSTates

    # Check if OPs are in order
    for iT in 1:wf.nT
        inds = wf.StartI[iT]:(wf.StartI[iT] + wf.nOP - 1)

        indOP = sortperm(wf.States_OP[inds, 4])
        if indOP != sort(indOP)  # check if already sorted
           wf.States_OP[inds, :] = wf.States_OP[inds[indOP], :]
           wf.States_T[inds, :]  = wf.States_T[inds[indOP], :]
           wf.States_WF[inds, :] = wf.States_WF[inds[indOP], :]
        end

    end
    return nothing
end

Any suggestions?

If you want to benchmark it, I suggest the following steps:

mkdir tmp
cd tmp
git clone https://github.com/ufechner7/FLORIDyn.jl
cd FLORIDyn.jl
julia --project

And then in Julia:

using Pkg
Pkg.instantiate()
include("test/bench_iterateops.jl")

On a Ryzen 7950X I get 149 µs.

While this is not bad, it is already 2.5 times faster than Matlab, I want more speed. This is an example of a simulation of a wind farm with 9 turbines, but in reality I need to simulate 100 turbines, and the execution time scales with the square of the number of turbines. And in the end I need to run optimizations, and for one optimization I probably need to run 1000 simulations or more.

In the end I want to optimize the performance of the function runFLORIDyn(set, wf, wind, sim, con, floridyn, floris). The function above is the sub-function with the highest execution time.

runFLORIDyn currently needs about 0.16 s to execute. But it also has way too high memory allocations:

include("examples/main.jl")
 (1.40 M allocations: 617.218 MiB, 20.38% gc time)
1 Like
wf.States_OP[wf.StartI, :]
wf.States_OP[inds, :] = 

etc.

Julia (and MATLAB) is column major, you want to slice on the first dimension. See Performance Tips · The Julia Language.

3 Likes

wf.States_OP = circshift(wf.States_OP, (1, 0)) maybe that can be a circshift! call to avoid some allocations?

also a lot of these places you probably want more views, right? remember stuff like

    wf.States_OP[:, 1] .+= cos.(phiW) .* step_dw .- sin.(phiW) .* step_cw[:, 1]
    wf.States_OP[:, 2] .+= sin.(phiW) .* step_dw .+ cos.(phiW) .* step_cw[:, 1]
    wf.States_OP[:, 3] .+= step_cw[:, 2]

will allocate on the getindex calls without an @views or view(x, ...)

2 Likes

Also, I suggest putting up a simple script that can be copy pasted and run without any dependencies or git commands etc. It will increase your likelihood to get help with approximately 732%.

4 Likes

Well, using views brought down the number of allocations from 180 to 162:

function iterateOPs!(::IterateOPs_basic, wf::WindFarm, sim::Sim, floris::Floris, floridyn::FloriDyn)
    # Save turbine OPs
    tmpOPStates = copy(wf.States_OP[wf.StartI, :])
    tmpTStates  = copy(wf.States_T[wf.StartI, :])
    tmpWFSTates = copy(wf.States_WF[wf.StartI, :])

    # Shift states
    # Downwind step
    step_dw = sim.time_step .* sim.dyn.advection .* @view wf.States_WF[:, 1] 
    wf.States_OP[:, 4] .+= step_dw

    # Crosswind step
    deflection = centerline(wf.States_OP, wf.States_T, wf.States_WF, floris, wf.D[1])
    step_cw = deflection .- @view wf.States_OP[:, 5:6]
    wf.States_OP[:, 5:6] .= deflection

    # World coordinate system adjustment
    phiW = angSOWFA2world.(@view wf.States_WF[:, 2])
    wf.States_OP[:, 1] .+= cos.(phiW) .* step_dw .- sin.(phiW) .* @view step_cw[:, 1]
    wf.States_OP[:, 2] .+= sin.(phiW) .* step_dw .+ cos.(phiW) .* @view step_cw[:, 1]
    wf.States_OP[:, 3] .+= @view step_cw[:, 2]

    # Circshift & init first OPs
    # OPs
    wf.States_OP = circshift(wf.States_OP, (1, 0))
    wf.States_OP[wf.StartI, :] = tmpOPStates

    # Turbines
    wf.States_T = circshift(wf.States_T, (1, 0))
    wf.States_T[wf.StartI, :] = tmpTStates

    # Wind Farm
    wf.States_WF = circshift(wf.States_WF, (1, 0))
    wf.States_WF[wf.StartI, :] = tmpWFSTates

    # Check if OPs are in order
    for iT in 1:wf.nT
        inds = wf.StartI[iT]:(wf.StartI[iT] + wf.nOP - 1)

        indOP = sortperm(wf.States_OP[inds, 4])
        if indOP != sort(indOP)  # check if already sorted
           wf.States_OP[inds, :] = wf.States_OP[inds[indOP], :]
           wf.States_T[inds, :]  = wf.States_T[inds[indOP], :]
           wf.States_WF[inds, :] = wf.States_WF[inds[indOP], :]
        end

    end
    return nothing
end

Not really doable, because of the setup needed for the input etc.

But it is not clear to me how to modify the code such that I can slice on the first dimension. This function is part of a large code base, and if I change the definition of these arrays I probably have to change 50 functions, which is not really realistic.

Assuming that the loops are much more expensive than everything else, paying a one-time cost of transposing the arrays is probably negligible in comparison to the speed gained by having faster loops. But only measurements will tell you that.

Now down to 144 allocations by using views in the function centerline():

function centerline(states_op, states_t, states_wf, floris, d_rotor)
    # Parameters
    k_a   = floris.k_a
    k_b   = floris.k_b
    alpha = floris.alpha
    beta  = floris.beta

    # States
    a = @view states_t[:,1]
    b = @view states_t[:,2]
    C_T   = calcCt(a, b)
    yaw   = .-deg2rad.(b)
    c = @view states_t[:,3]
    d = @view states_wf[:,3]
    I     = sqrt.(c.^2 .+ d.^2)
    OPdw  = @view states_op[:,4]

    # Calc x_0 (Core length)
    x_0 = (cos.(yaw) .* (1 .+ sqrt.(1 .- C_T)) ./ 
         (sqrt(2) .* (alpha .* I .+ beta .* (1 .- sqrt.(1 .- C_T))))) .* d_rotor

    # Calc k_z and k_y based on I
    k_y = k_a .* I .+ k_b
    k_z = k_y

    # Get field width y
    zs = zeros(size(OPdw))
    sig_y = max.(OPdw .- x_0, zs) .* k_y .+
        min.(OPdw ./ x_0, zs .+ 1) .* cos.(yaw) .* d_rotor ./ sqrt(8)

    # Get field width z
    sig_z = max.(OPdw .- x_0, zs) .* k_z .+
        min.(OPdw ./ x_0, zs .+ 1) .* d_rotor ./ sqrt(8)

    # Calc Theta
    Theta = 0.3 .* yaw ./ cos.(yaw) .* (1 .- sqrt.(1 .- C_T .* cos.(yaw)))

    # Calc Delta/Deflection
    delta_nfw = Theta .* min.(OPdw, x_0)

    delta_fw_1 = Theta ./ 14.7 .* sqrt.(cos.(yaw) ./ (k_y .* k_z .* C_T)) .* (2.9 .+ 1.3 .* sqrt.(1 .- C_T) .- C_T)
    delta_fw_2 = log.(Complex.(
        (1.6 .+ sqrt.(C_T)) .* 
        (1.6 .* sqrt.((8 .* sig_y .* sig_z) ./ (d_rotor^2 .* cos.(yaw))) .- sqrt.(C_T)) ./
        ((1.6 .- sqrt.(C_T)) .*
        (1.6 .* sqrt.((8 .* sig_y .* sig_z) ./ (d_rotor^2 .* cos.(yaw))) .+ sqrt.(C_T)))
    ))
    # Use signbit and broadcasting for the sign/(2) + 0.5 logic
    factor = (sign.(OPdw .- x_0) ./ 2 .+ 0.5)

    deltaY = delta_nfw .+ factor .* delta_fw_1 .* delta_fw_2 .* d_rotor
    
    # Deflection in y and z direction
    delta = hcat(deltaY, zeros(size(deltaY)))
    
    return delta
end
1 Like

I think there is is function issorted that doessn’t allocate.

5 Likes

We could do a lot of things but after quick tests, most of the time is probably taken by the circ shifting, you should try to implement your own that shift all 3 arrays at the same time with as less allocs as possible. here is the test I did, it’s a little different from yours obviously but it will allow people to test here

f(x) = x^2

function iterateOPs!(A,B,C,ts,da,sI,nop,nT)
    # Save turbine OPs
    tmpOPStates =  A[sI, :]
    tmpTStates  =  B[sI, :]
    tmpWFSTates =  C[sI, :]
    # Shift states
    # Downwind step
    step_dw = ts .* @view(C[:, 1]) .* da
    @views A[:, 4] .+= step_dw
    # Crosswind step
    deflection = 1.0
    step_cw = deflection .- @view(A[:, 5:6])
    @views A[:, 5:6] .= deflection

    # World coordinate system adjustment
    for i in axes(A, 1)
        phiWi = f(C[i, 2])
        cphiWi = cos(phiWi)
        step_dwi = step_dw[i]
        A[i, 1] += cphiWi * step_dwi - sin(phiWi) * step_cw[i, 2]
        A[i, 2] += cphiWi  *step_cw[i, 1] + sin(phiWi) * step_dwi
        A[i, 3] += step_cw[i,2]
    end
    # Circshift & init first OPs
    # OPs
    A = circshift(A, (1, 0))
    copyto!(@view(A[sI,:]),tmpOPStates)
    # Turbines
    B = circshift(B, (1, 0))
    copyto!(@view(B[sI,:]),tmpTStates)
    # Wind Farm
    C = circshift(C, (1, 0))
    copyto!(@view(C[sI,:]),tmpWFSTates)
    indOPBUF = zeros(Int,size(A,1))
    # Check if OPs are in order
    for iT in 1:nT
        inds = sI[iT]:(sI[iT] + nop - 1)
        indOP = @view(indOPBUF[inds])
        sortperm!(indOP,@view A[inds, 4])
        if !issorted(indOP)  # check if already sorted
            ID = inds[indOP]
            for j in axes(A,2)
                for i in inds
                    A[i,j] = A[ID, j]
                    B[i,j] = B[ID, j]
                    C[i,j] = C[ID, j] 
                end
            end
        end

    end
    return nothing
end

A = zeros(100,100)
B = zeros(100,100)
C = zeros(100,100)
ts = 0.1
da = 0.01
sI = rand(1:100, 10)  # Random indices for turbines
nop = 1
nT = length(sI)
using BenchmarkTools
@btime iterateOPs!($A, $B, $C, $ts, $da, $sI, $nop, $nT)

I did most perf boost I could though like fusing your 3 broadcast on A( wf.States_OP) , add views everywhere needed, use issorted, buffer the sortperm, use O(1) copyto! ect ect

this should be done with 3 allocs in a single function :

    A = circshift(A, (1, 0))
    copyto!(@view(A[sI,:]),tmpOPStates)
    # Turbines
    B = circshift(B, (1, 0))
    copyto!(@view(B[sI,:]),tmpTStates)
    # Wind Farm
    C = circshift(C, (1, 0))
    copyto!(@view(C[sI,:]),tmpWFSTates)

chatgpt take :

f(x) = x^2

@views function fused_shift_copy!(A, B, C,tmpOPStates, tmpTStates, tmpWFSTates,sI)
    n, _ = size(A)
    @inbounds begin
        for i = n:-1:2
            copyto!(A[i, :], A[i-1, :])
            copyto!(B[i, :], B[i-1, :])
            copyto!(C[i, :], C[i-1, :])
        end
        copyto!(A[1, :], A[n, :])
        copyto!(B[1, :], B[n, :])
        copyto!(C[1, :], C[n, :])
        for (k, idx) in enumerate(sI)
            copyto!(A[idx, :], tmpOPStates[k, :])
            copyto!(B[idx, :], tmpTStates[k, :])
            copyto!(C[idx, :], tmpWFSTates[k, :])
        end
    end
    return nothing
end


function iterateOPs!(A,B,C,ts,da,sI,nop,nT)
    # Save turbine OPs
    tmpOPStates =  A[sI, :]
    tmpTStates  =  B[sI, :]
    tmpWFSTates =  C[sI, :]
    # Shift states
    # Downwind step
    step_dw = ts .* @view(C[:, 1]) .* da
    @views A[:, 4] .+= step_dw
    # Crosswind step
    deflection = 1.0
    step_cw = deflection .- @view(A[:, 5:6])
    @views A[:, 5:6] .= deflection

    # World coordinate system adjustment
    @inbounds for i in axes(A, 1)
        phiWi = f(C[i, 2])
        cphiWi = cos(phiWi)
        step_dwi = step_dw[i]
        A[i, 1] += cphiWi * step_dwi - sin(phiWi) * step_cw[i, 2]
        A[i, 2] += cphiWi  *step_cw[i, 1] + sin(phiWi) * step_dwi
        A[i, 3] += step_cw[i,2]
    end
    # Circshift & init first OPs
    # OPs
    fused_shift_copy!(A, B, C,tmpOPStates, tmpTStates, tmpWFSTates,sI)
    indOPBUF = zeros(Int,size(A,1))
    # Check if OPs are in order
    for iT in 1:nT
        inds = sI[iT]:(sI[iT] + nop - 1)
        indOP = @view(indOPBUF[inds])
        sortperm!(indOP,@view A[inds, 4])
        if !issorted(indOP)  # check if already sorted
            ID = @view inds[indOP]
            for j in axes(A,2)
                @inbounds for i in inds
                    A[i,j] = A[ID, j]
                    B[i,j] = B[ID, j]
                    C[i,j] = C[ID, j] 
                end
            end
        end

    end
    return nothing
end

A = zeros(100,100)
B = zeros(100,100)
C = zeros(100,100)
ts = 0.1
da = 0.01
sI = rand(1:100, 10)  # Random indices for turbines
nop = 1
nT = length(sI)
using BenchmarkTools
@btime iterateOPs!($A, $B, $C, $ts, $da, $sI, $nop, $nT)

now down to 14 allocs (buffer ones) and quite fast

1 Like

This helped to bring down the allocations to 128 (from 180 in the beginning).

1 Like

You need to use views far more widely, you are still missing most places.

Just as an example, the following line needs @views

I don’t know what type is in here, exactly, but slices often make copies, so this probably makes a copy of a copy:

In fact, probably just wrap the whole function in a @views macro.

Can you try with this code :

@views function iterateOPs!(::IterateOPs_basic, wf::WindFarm, sim::Sim, floris::Floris, floridyn::FloriDyn)
    # Unpack start indices and counts
    sI, nT, nop = wf.StartI, wf.nT, wf.nOP
    # Save turbine OPs
    tmpOP = copy(wf.States_OP[sI, :])
    tmpT  = copy(wf.States_T[sI, :])
    tmpWF = copy(wf.States_WF[sI, :])

    # Pre-calc downwind shift
    step_dw = sim.time_step .* wf.States_WF[:, 1] .* sim.dyn.advection
    wf.States_OP[:, 4] .+= step_dw

    # Crosswind shift
    deflection = centerline(wf.States_OP, wf.States_T, wf.States_WF, floris, wf.D[1])
    step_cw = deflection .- wf.States_OP[:, 5:6]
    wf.States_OP[:, 5:6] .= deflection

    # World-coord adjustment
    n = size(wf.States_OP, 1)
    @inbounds for i in 1:n
        phi = angSOWFA2world(wf.States_WF[i, 2])
        cphi, sphi = cos(phi), sin(phi)
        dw = step_dw[i]
        cw1, cw2 = step_cw[i, 1], step_cw[i, 2]
        wf.States_OP[i, 1] += cphi * dw - sphi * cw1
        wf.States_OP[i, 2] += sphi * dw + cphi * cw1
        wf.States_OP[i, 3] += cw2
    end

    # Fused circshift of all state arrays
    A, B, C = wf.States_OP, wf.States_T, wf.States_WF
    @inbounds for i in n:-1:2
        for j in axes(A, 2)
            A[i, j] = A[i-1, j]
            B[i, j] = B[i-1, j]
            C[i, j] = C[i-1, j]
        end
    end
    for j in axes(A, 2)
        A[1, j] = A[n, j]
        B[1, j] = B[n, j]
        C[1, j] = C[n, j]
    end

    # Re-initialize head-of-line entries
    for (k, idx) in enumerate(sI)
        for j in axes(A, 2)
            A[idx, j] = tmpOP[k, j]
            B[idx, j] = tmpT[k, j]
            C[idx, j] = tmpWF[k, j]
        end
    end

    # Check and enforce ordering per turbine
    indbuf = similar(wf.States_OP[:, 4], Int)
    for t in 1:nT
        inds = sI[t]:(sI[t] + nop - 1)
        indOP = indbuf[inds]
        sortperm!(indOP, wf.States_OP[inds, 4])
        if !issorted(indOP)
            IDs = inds[indOP]
            for (k, i) in enumerate(inds)
                @inbounds for j in axes(A, 2)
                    A[i, j] = A[IDs[k], j]
                    B[i, j] = B[IDs[k], j]
                    C[i, j] = C[IDs[k], j]
                end
            end
        end
    end

    return nothing
end

I’m sorry if some things are incorect. I supposed wf.States_OP, wf.States_T and wf.States_WF had the same shape, if they don’t things may change a bit

edit : Can also change this

    tmpOP = copy(wf.States_OP[sI, :])
    tmpT  = copy(wf.States_T[sI, :])
    tmpWF = copy(wf.States_WF[sI, :])

to this :

    tmp = Array{Float64,3}(undef,length(sI),size(wf.States_OP,2),3)
    tmpOP = tmp[:,:,1]
    tmpT = tmp[:,:,2]
    tmpWF = tmp[:,:,3]
    tmpOP .= wf.States_OP[sI, :]
    tmpT  .= wf.States_T[sI, :]
    tmpWF .= wf.States_WF[sI, :]

with @views activated obviously, its cheating a bit since we take down allocs but actually its the same size in memory just better alligned all this lead to only 5 allocs at the end without counting the centerline allocs which you could write :

function centerline(states_op, states_t, states_wf, floris, d_rotor)
    # Unpack parameters
    k_a, k_b = floris.k_a, floris.k_b
    α, β     = floris.alpha,  floris.beta
    n        = size(states_op, 1)
    inv_sqrt2 = 1 / sqrt(2)

    # Allocate output
    delta = Matrix{Float64}(undef, n, 2)

    @inbounds for i in 1:n
        # Load state values
        a_i    = states_t[i, 1]
        b_i    = states_t[i, 2]
        c_i    = states_t[i, 3]
        d_i    = states_wf[i, 3]
        op_i   = states_op[i, 4]

        # Compute coefficients
        C_T      = calcCt(a_i, b_i)
        yaw      = -deg2rad(b_i)
        cos_yaw  = cos(yaw)
        sin_yaw  = sin(yaw)
        I_val    = sqrt(c_i*c_i + d_i*d_i)
        sqrt_T   = sqrt(1 - C_T)

        # Core length
        denom    = α*I_val + β*(1 - sqrt_T)
        x0       = cos_yaw*(1 + sqrt_T)*inv_sqrt2/denom * d_rotor

        # Wake widths
        dy       = op_i - x0
        mask1    = dy > 0 ? dy : 0.0
        mask2    = op_i/x0 < 1 ? op_i/x0 : 1.0
        ky       = k_a*I_val + k_b
        kz       = ky
        base     = d_rotor*inv_sqrt2
        sig_y    = mask1*ky + mask2*cos_yaw*base
        sig_z    = mask1*kz + mask2*base

        # Deflection angles
        Theta    = 0.3*yaw/cos_yaw*(1 - sqrt(1 - C_T*cos_yaw))
        delta_nf = Theta * (op_i < x0 ? op_i : x0)

        # Far-wake components
        denom_fw = ky*kz*C_T
        fw1      = Theta/14.7*sqrt(cos_yaw/denom_fw)*(2.9 + 1.3*sqrt_T - C_T)
        tmp      = 8*sig_y*sig_z/(d_rotor*d_rotor*cos_yaw)
        sq       = sqrt(tmp)
        num      = (1.6 + sqrt(C_T))*(1.6*sq - sqrt(C_T))
        den      = (1.6 - sqrt(C_T))*(1.6*sq + sqrt(C_T))
        fw2      = log(num/den)

        # Blend and finalize
        factor   = (dy < 0 ? 0.0 : 1.0)
        δY       = delta_nf + factor*fw1*fw2*d_rotor

        # Store result
        delta[i, 1] = δY
        delta[i, 2] = 0.0
    end

    return delta
end

1 Like

Why not just put @views in front of the function definition? It always surprises me to see people putting individual @view calls on a dozen lines.

1 Like

for those you actually want copy by slicing :

tmpOP = wf.States_OP[sI, :]
    tmpT  = wf.States_T[sI, :]
    tmpWF = wf.States_WF[sI, :]

You can opt-out of @views if you want copies on a few lines by putting an explicit copy call around the slice, or by using getindex(a, ...) instead of a[...]. It still seems clearer to me than having zillions of @view statements.

Down to 90 allocations now (from 180). But I did not implement all suggestions yet.

1 Like

Yeah I changed it youre right

Yes, I actually did suggest that, too😁

1 Like