Slow code to compute b=A*x

ww1g11 · July 3, 2024, 5:33am

Hi all,

I wrote a very simple code to compute b=A*x and found that it is slow, please see the following codes to compare sum1, sum2 and sum3:

using TimerOutputs
const timer = TimerOutput()

function test(N=20000)
    A = rand(N, N)
    b = zeros(N)
    x = rand(N)

    @timeit timer "sum1" begin
        for i in 1:N
            f = 0.0
            for j in 1:N
                f += A[i, j] * x[j]
            end
            b[i] = f
        end
    end

    @timeit timer "sum2" begin
        for i in 1:N
            f = 0.0
            for j in 1:N
                f = A[i, j] * x[j]
            end
            b[i] = f
        end
    end

    @timeit timer "sum3" begin
        for i in 1:N
            f = 0.0
            for j in 1:N
                f += A[i, j] * x[j]
            end
            #b[i] = f
        end
    end

end

test()
println(timer)

The output is

 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations
                   ───────────────────────   ────────────────────────
 Tot / % measured:      3.96s /  79.5%           3.00GiB /   0.0%

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 sum1           1    2.90s   92.3%   2.90s     0.00B     - %    0.00B
 sum3           1    122ms    3.9%   122ms     0.00B     - %    0.00B
 sum2           1    121ms    3.8%   121ms     0.00B     - %    0.00B
 ────────────────────────────────────────────────────────────────────

I just wonder why sum1 is much slower than sum2 or sum3.

The versioninfo() in the test computer gives

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × 13th Gen Intel(R) Core(TM) i7-13700F
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)
Environment:
  JULIA_PKG_SERVER = https://mirrors.ustc.edu.cn/julia
  JULIA_SSL_CA_ROOTS_PATH =

jling · July 3, 2024, 5:34am

JIT compilation?

ww1g11 · July 3, 2024, 5:39am

it seems not,

I added the following code and test again,

reset_timer!(timer)
test()
println(timer)

but the output is similar.

sgaure · July 3, 2024, 5:49am

In sum2 you only use the last f from the inner loop, so the compiler probably optimizes it away and replaces it with a single assignment. In sum3 you are not using the result of the inner loop, so it’s probably optimized out.

sum1 is probably slow because you do it the wrong way. The first index should be looped over in the inner loop.

ww1g11 · July 3, 2024, 6:13am

I know first index should be placed in the inner loop for a column-major matrix, but why sum1 is much slower than sum2 or sum3, because in principle, they do not fit the memory layout either.

sgaure · July 3, 2024, 6:18am

sum2 and sum3 effectively do not have an inner loop. The compiler will optimize it away since the results of the inner loop are not used.

ww1g11 · July 3, 2024, 6:39am

Thanks, I guess this is the reason.

sgaure · July 3, 2024, 7:55am

You can compare it to LinearAlgebra.mul!(b, A, x). This calls into the BLAS, and is probably very hard to beat.

Topic		Replies	Views
Why is sum slower than multiplying by vector of ones? Performance	4	478	September 28, 2023
Faster sum loop when looping through vector multiple times than once Performance	5	291	September 29, 2023
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Unclear allocation behaviour with built-in sum() Performance question , memory-allocation	4	204	February 2, 2025
Reduce memory allocated in array view and in place sum Performance question	12	694	November 10, 2023

Slow code to compute b=A*x

Related topics