Huge performance variance with `if` options in loops

I’m trying to optimize a non-allocating matrix update. Here is the simple code.

function update_W_original!(
    W::AbstractMatrix,
    l::Int,
    K::Int,
    col_cache::AbstractVector,
    row_cache::AbstractVector,
)
    inv_WKL = inv(W[K, l])

    copyto!(row_cache, view(W, K, :))
    row_cache[l] -= 1.0
    copyto!(col_cache, view(W, :, l))

    rows, cols = axes(W)

    @inbounds for j in cols
        scalar_Krow_j = row_cache[j]
        if scalar_Krow_j != 0
            @simd for i in rows
                factor_i = col_cache[i] * inv_WKL
                W[i, j] -= factor_i * scalar_Krow_j
            end
        end
    end
    return nothing
end

This is some code advised by Gemini, and reached best speed on my machine so far. But I’m confused with the if, since it’s expensive in loops, and my row_cache is unlikely to be sparse.
I want to remove the if and see whether it helps with performance.

# Modified update_W! function without the if condition
function update_W_no_if!(
    W::AbstractMatrix,
    l::Int,
    K::Int,
    col_cache::AbstractVector,
    row_cache::AbstractVector,
)
    inv_WKL = inv(W[K, l])

    copyto!(row_cache, view(W, K, :))
    row_cache[l] -= 1.0
    copyto!(col_cache, view(W, :, l))

    rows, cols = axes(W)

    @inbounds for j in cols
        scalar_Krow_j = row_cache[j]
        @simd for i in rows
            factor_i = col_cache[i] * inv_WKL
            W[i, j] -= factor_i * scalar_Krow_j
        end
    end
    return nothing
end

# Setup for benchmarking
function setup_benchmark(matrix_size::Int)
    W = rand(ComplexF64, matrix_size, matrix_size)
    l = rand(1:matrix_size)
    K = rand(1:matrix_size)
    col_cache = rand(ComplexF64, matrix_size)
    row_cache = rand(ComplexF64, matrix_size)
    return W, l, K, col_cache, row_cache
end

matrix_size = 1000 # Example size, adjust as needed
W, l, K, col_cache, row_cache = setup_benchmark(matrix_size)

println("\nOriginal update_W!:")
@btime update_W_original!($W, $l, $K, $col_cache, $row_cache) setup = (W_copy = deepcopy(W))

println("\nupdate_W! without if condition:")
@btime update_W_no_if!($W, $l, $K, $col_cache, $row_cache) setup = (W_copy = deepcopy(W))

To my suprise, I saw a HUGE performance variation:

    Original update_W!:
  3.716 μs (2 allocations: 32 bytes)

update_W! without if condition:
  1.591 ms (2 allocations: 32 bytes)

I tried to investigate and I infiltrate into the loop, where I saw the random row_cache unexpectedly turns into nearly zero!

infil> row_cache
1000-element Vector{ComplexF64}:
 2.7755575615628914e-17 + 0.0im
  5.551115123125783e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 1.3877787807814457e-17im
                    0.0 - 5.551115123125783e-17im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 5.551115123125783e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 - 2.7755575615628914e-17im
 2.0816681711721685e-17 + 0.0im
                    0.0 - 5.551115123125783e-17im
                    0.0 - 5.551115123125783e-17im
                    0.0 - 5.377642775528102e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 - 2.7755575615628914e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 - 5.551115123125783e-17im
                        ⋮
  5.551115123125783e-17 + 0.0im
  5.551115123125783e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
  5.551115123125783e-17 + 0.0im
 2.7755575615628914e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 2.42861286636753e-17im
 2.7755575615628914e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 + 0.0im
                    0.0 - 4.7596475372113645e-17im
                    0.0 + 0.0im
  5.551115123125783e-17 + 0.0im
                    0.0 + 0.0im
                    0.0 - 5.551115123125783e-17im

That’s quite strange, I couldn’t figure out why this row_cache changes. And I feel really unsafe on this optimization, though some naive test says it reproduces the same data…

Can anyone help on this? I’m really confused…

My suggestion is to try Base.ifelse—Function

ifelse(condition::Bool, x, y)

Return x if condition is true, otherwise return y. This differs from ? or if in that it is an ordinary function, so all the arguments are evaluated first. In some cases, using ifelse instead of an if statement can eliminate the branch in generated code and provide higher performance in tight loops.

4 Likes

Your benchmark reuses W instead of inputing the W_copy you make in the setup, so it’s changing with each call. If we roughly substitute terms, then the change is W[i, l] -= W[i, l] * inv(W[K, l]) * (W[K, l] - 1.0) and otherwise W[i, j] -= W[i, l] * inv(W[K, l]) * W[K, j]. Notably:

  1. when i == K && j == l, then W[K, l] -= W[K, l] * inv(W[K, l]) * (W[K, l] - 1.0), which simplifies to W[K, l] = 1.0 within floating point errors. Indeed, we can check after 1 update_W_original! call that W[K,l] is approximately 1.0 + 0.0im.

  2. when i == K && j != l, then W[K, j] -= W[K, l] * inv(W[K, l]) * W[K, j], which simplifies to W[K, j] = 0.0 within floating point errors. Indeed, if we check W[K, :] after 1 call, it’s all approximately zero except for W[K,l].

During the 2nd call, you write the mostly zeroed out W[K, :] to row_cache then subtract the only element with a one by one. The if check skipping any changes to W explains the weirdly fast benchmark. When you omit the if check in the 2nd benchmark, you still perpetuate the effects on W[K,:]. If you fix the setup, it’s probably safer to do evals=1 as well because the setup runs before each sample of repeated calls, not per call.

1 Like

Thanks for your prompt and detailed reply! That’s exactly what I encountered.

I compared this approach with the BLAS 1-rank update function geru!, and found this manual loop approach beat BLAS! :squinting_face_with_tongue: Now I know how ridiculous it was!

Yes actually I think in my case the if is not necessary, as when if statement fails, all I need to do is adding lots of zeros.

I didn’t take care of the gemini output, now I see how strange it was.