Huge performance variance with `if` options in loops

hz-xiaxz · July 4, 2025, 3:47am

I’m trying to optimize a non-allocating matrix update. Here is the simple code.

function update_W_original!(
    W::AbstractMatrix,
    l::Int,
    K::Int,
    col_cache::AbstractVector,
    row_cache::AbstractVector,
)
    inv_WKL = inv(W[K, l])

    copyto!(row_cache, view(W, K, :))
    row_cache[l] -= 1.0
    copyto!(col_cache, view(W, :, l))

    rows, cols = axes(W)

    @inbounds for j in cols
        scalar_Krow_j = row_cache[j]
        if scalar_Krow_j != 0
            @simd for i in rows
                factor_i = col_cache[i] * inv_WKL
                W[i, j] -= factor_i * scalar_Krow_j
            end
        end
    end
    return nothing
end

This is some code advised by Gemini, and reached best speed on my machine so far. But I’m confused with the if, since it’s expensive in loops, and my row_cache is unlikely to be sparse.
I want to remove the if and see whether it helps with performance.

# Modified update_W! function without the if condition
function update_W_no_if!(
    W::AbstractMatrix,
    l::Int,
    K::Int,
    col_cache::AbstractVector,
    row_cache::AbstractVector,
)
    inv_WKL = inv(W[K, l])

    copyto!(row_cache, view(W, K, :))
    row_cache[l] -= 1.0
    copyto!(col_cache, view(W, :, l))

    rows, cols = axes(W)

    @inbounds for j in cols
        scalar_Krow_j = row_cache[j]
        @simd for i in rows
            factor_i = col_cache[i] * inv_WKL
            W[i, j] -= factor_i * scalar_Krow_j
        end
    end
    return nothing
end

# Setup for benchmarking
function setup_benchmark(matrix_size::Int)
    W = rand(ComplexF64, matrix_size, matrix_size)
    l = rand(1:matrix_size)
    K = rand(1:matrix_size)
    col_cache = rand(ComplexF64, matrix_size)
    row_cache = rand(ComplexF64, matrix_size)
    return W, l, K, col_cache, row_cache
end

matrix_size = 1000 # Example size, adjust as needed
W, l, K, col_cache, row_cache = setup_benchmark(matrix_size)

println("\nOriginal update_W!:")
@btime update_W_original!($W, $l, $K, $col_cache, $row_cache) setup = (W_copy = deepcopy(W))

println("\nupdate_W! without if condition:")
@btime update_W_no_if!($W, $l, $K, $col_cache, $row_cache) setup = (W_copy = deepcopy(W))

To my suprise, I saw a HUGE performance variation:

    Original update_W!:
  3.716 μs (2 allocations: 32 bytes)

update_W! without if condition:
  1.591 ms (2 allocations: 32 bytes)

I tried to investigate and I infiltrate into the loop, where I saw the random row_cache unexpectedly turns into nearly zero!

infil> row_cache
1000-element Vector{ComplexF64}:
 2.7755575615628914e-17 + 0.0im
  5.551115123125783e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 1.3877787807814457e-17im
                    0.0 - 5.551115123125783e-17im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 5.551115123125783e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 - 2.7755575615628914e-17im
 2.0816681711721685e-17 + 0.0im
                    0.0 - 5.551115123125783e-17im
                    0.0 - 5.551115123125783e-17im
                    0.0 - 5.377642775528102e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 - 2.7755575615628914e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 - 5.551115123125783e-17im
                        ⋮
  5.551115123125783e-17 + 0.0im
  5.551115123125783e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
  5.551115123125783e-17 + 0.0im
 2.7755575615628914e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 2.42861286636753e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 4.7596475372113645e-17im
                    0.0 + 0.0im
  5.551115123125783e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 - 5.551115123125783e-17im

That’s quite strange, I couldn’t figure out why this row_cache changes. And I feel really unsafe on this optimization, though some naive test says it reproduces the same data…

Can anyone help on this? I’m really confused…

karei · July 4, 2025, 4:01am

My suggestion is to try Base.ifelse—Function

ifelse(condition::Bool, x, y)

Return x if condition is true, otherwise return y. This differs from ? or if in that it is an ordinary function, so all the arguments are evaluated first. In some cases, using ifelse instead of an if statement can eliminate the branch in generated code and provide higher performance in tight loops.

Benny · July 4, 2025, 4:52am

Your benchmark reuses W instead of inputing the W_copy you make in the setup, so it’s changing with each call. If we roughly substitute terms, then the change is W[i, l] -= W[i, l] * inv(W[K, l]) * (W[K, l] - 1.0) and otherwise W[i, j] -= W[i, l] * inv(W[K, l]) * W[K, j]. Notably:

when i == K && j == l, then W[K, l] -= W[K, l] * inv(W[K, l]) * (W[K, l] - 1.0), which simplifies to W[K, l] = 1.0 within floating point errors. Indeed, we can check after 1 update_W_original! call that W[K,l] is approximately 1.0 + 0.0im.
when i == K && j != l, then W[K, j] -= W[K, l] * inv(W[K, l]) * W[K, j], which simplifies to W[K, j] = 0.0 within floating point errors. Indeed, if we check W[K, :] after 1 call, it’s all approximately zero except for W[K,l].

During the 2nd call, you write the mostly zeroed out W[K, :] to row_cache then subtract the only element with a one by one. The if check skipping any changes to W explains the weirdly fast benchmark. When you omit the if check in the 2nd benchmark, you still perpetuate the effects on W[K,:]. If you fix the setup, it’s probably safer to do evals=1 as well because the setup runs before each sample of repeated calls, not per call.

hz-xiaxz · July 4, 2025, 6:50am

Thanks for your prompt and detailed reply! That’s exactly what I encountered.

I compared this approach with the BLAS 1-rank update function geru!, and found this manual loop approach beat BLAS! Now I know how ridiculous it was!

hz-xiaxz · July 4, 2025, 6:51am

Yes actually I think in my case the if is not necessary, as when if statement fails, all I need to do is adding lots of zeros.

I didn’t take care of the gemini output, now I see how strange it was.

Topic		Replies	Views
To factor or not to factor? Functions with small differences within loops Performance	4	699	May 16, 2020
Hot loop updating array elements with product over other array Performance	15	844	August 17, 2018
Performance optimization：Frequently use permutedims function Performance question	22	367	May 1, 2025
Performance difference in permuting Arrays Performance sort , sortperm	4	535	October 11, 2021
Most performant way to perform matrix and vector calculations in a function Performance question , performance , speed-optimization , matrix	5	515	October 9, 2023

Huge performance variance with `if` options in loops

Related topics