Slow code to compute b=A*x

Hi all,

I wrote a very simple code to compute b=A*x and found that it is slow, please see the following codes to compare sum1, sum2 and sum3:

using TimerOutputs
const timer = TimerOutput()

function test(N=20000)
    A = rand(N, N)
    b = zeros(N)
    x = rand(N)

    @timeit timer "sum1" begin
        for i in 1:N
            f = 0.0
            for j in 1:N
                f += A[i, j] * x[j]
            end
            b[i] = f
        end
    end

    @timeit timer "sum2" begin
        for i in 1:N
            f = 0.0
            for j in 1:N
                f = A[i, j] * x[j]
            end
            b[i] = f
        end
    end

    @timeit timer "sum3" begin
        for i in 1:N
            f = 0.0
            for j in 1:N
                f += A[i, j] * x[j]
            end
            #b[i] = f
        end
    end

end

test()
println(timer)

The output is

 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations
                   ───────────────────────   ────────────────────────
 Tot / % measured:      3.96s /  79.5%           3.00GiB /   0.0%

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 sum1           1    2.90s   92.3%   2.90s     0.00B     - %    0.00B
 sum3           1    122ms    3.9%   122ms     0.00B     - %    0.00B
 sum2           1    121ms    3.8%   121ms     0.00B     - %    0.00B
 ────────────────────────────────────────────────────────────────────

I just wonder why sum1 is much slower than sum2 or sum3.

The versioninfo() in the test computer gives

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 Γ— 13th Gen Intel(R) Core(TM) i7-13700F
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)
Environment:
  JULIA_PKG_SERVER = https://mirrors.ustc.edu.cn/julia
  JULIA_SSL_CA_ROOTS_PATH =

JIT compilation?

it seems not,

I added the following code and test again,

reset_timer!(timer)
test()
println(timer)

but the output is similar.

In sum2 you only use the last f from the inner loop, so the compiler probably optimizes it away and replaces it with a single assignment. In sum3 you are not using the result of the inner loop, so it’s probably optimized out.

sum1 is probably slow because you do it the wrong way. The first index should be looped over in the inner loop.

2 Likes

I know first index should be placed in the inner loop for a column-major matrix, but why sum1 is much slower than sum2 or sum3, because in principle, they do not fit the memory layout either.

sum2 and sum3 effectively do not have an inner loop. The compiler will optimize it away since the results of the inner loop are not used.

Thanks, I guess this is the reason.

You can compare it to LinearAlgebra.mul!(b, A, x). This calls into the BLAS, and is probably very hard to beat.