How to improve the performance while do matrix operation

I am a new Julia user. When doing matrix operations, there is some confusion about the performance of the code.

Format 1

C=0
A=[2 4 1; 4 1 4; 4 2 3]

@time for i=1:10000000
    global A=A/i
    A=A+A'
    global C=C+sum(A)
end
C

3.792766 seconds (60.00 M allocations: 3.576 GiB, 3.94% gc time)

It takes 3.79s.

Format 2

By referring to documents, the global variables is avoided:

A=[2.0 4 1; 4 1 4; 4 2 3]

function loop_over_global(A::Array{Float64,2})
    c=0.0
    for i=1:10000000
        A=A/i
        A=A.+A'
        c+=sum(A)
    end
    return c
end

@time loop_over_global(A)

1.594670 seconds (20.04 M allocations: 2.982 GiB, 10.59% gc time)

It takes 1.59s.

Format 3

By replacing the default matrix operations into loop operations:

A=[2.0 4 1; 4 1 4; 4 2 3]

function loop_over_global(A::Array{Float64,2})
    c=0.0
    for i=1:10^7
        for k=1:3
            for j=1:3
                A[j,k]=A[j,k]/i
            end
        end
        
        for k=1:3
            for j=k:3
                A[j,k]=A[j,k]+A'[j,k]
            end
        end
        
        for k=1:3
            for j=1:k-1
                A[j,k]=A[k,j]
            end
        end
        
        c+=sum(A)
    end
    return c
end

@time loop_over_global(A)

0.442143 seconds (82.71 k allocations: 4.293 MiB)

It takes 0.44s.

Fortran code

    program main

    integer(kind=8):: ip,i,j
    real(kind=8):: A(3,3),c,t1,t2


    c=0.0
    A(1,1) = 2; A(1,2) = 4; A(1,3) = 1;
    A(2,1) = 4; A(2,2) = 1; A(2,3) = 4;
    A(3,1) = 4; A(3,2) = 2; A(3,3) = 3;

    call cpu_time(t1)
    do ip=1,10000000
        A=A/ip

        A=A+transpose(A)

        do i=1,3
            do j=1,3
                c=c+A(j,i)
            enddo
        enddo
    enddo
    call cpu_time(t2)

    write(*,*) c,t2-t1

    end

Uing IVF2011. It takes 0.25s.

I am very confused about the performance difference between format 2 and format 3. And can Julia code be further improved?

Adding @inbounds made it about 2x faster which should match Fortran on your computer. I also put together the double loops: this is just a style thing. The code is this:

A=[2.0 4 1; 4 1 4; 4 2 3]
function loop_over_global(A)
    c=0.0
    @inbounds for i=1:10^7
        for k=1:3, j=1:3
          A[j,k]=A[j,k]/i
        end

        for k=1:3, j=k:3
          A[j,k]=A[j,k]+A'[j,k]
        end

        for k=1:3, j=1:k-1
          A[j,k]=A[k,j]
        end

        c+=sum(A)
    end
    return c
end

@time loop_over_global(A)
@time loop_over_global(A)
2 Likes

StaticArrays might help here. It is about 8 times faster for me.
Do you need to use a global variable or can you use a local variable too?

using StaticArrays
using BenchmarkTools

A=[2.0 4 1; 4 1 4; 4 2 3]

function loop_over_global(A::Array{Float64,2})
    c=0.0
    for i=1:10000000
        A=A/i
        A=A.+A'
        c+=sum(A)
    end
    return c
end


function loop_over_local()
    A=[2.0 4 1; 4 1 4; 4 2 3]
    c=0.0
    for i=1:10000000
        A=A/i
        A=A.+A'
        c+=sum(A)
    end
    return c
end

function loop_over_local_SMatrix()
    B=[2.0 4 1; 4 1 4; 4 2 3]
    A=SMatrix{3,3}(B)
    c=0.0
    for i=1:10000000
        A=A/i
        A=A.+A'
        c+=sum(A)
    end
    return c
end

@btime loop_over_global(A) #1.017s
@btime loop_over_local() #1.033s
@btime loop_over_local_SMatrix() #159 ms
1 Like

This should be as fast as your Fortran version:

using StaticArrays
A = @SMatrix [2.0 4 1; 4 1 4; 4 2 3]

function loop_over_global(A)
    c = 0.0
    B = copy(A)
    for i = 1:10000000
        A  = B ./ i
        B  = A + A'
        c += sum(B)
    end
    return c
end

# Warmup, then time 
@time loop_over_global(A)
@time loop_over_global(A)

Time : 0.106067 seconds (5 allocations: 176 bytes)

1 Like

A few of notes about what is going on here in addition to the replies already posted -

  1. The docs are right; don’t use globals :smile:
  2. If you were trying to use broadcasting in Format 2, you missed a few dots out. Try
    A .= A ./ i
    A .= A .+ A'
    
    The .= means that the result is put back into the original matrix rather than creating a new matrix and making A “point to” the new matrix. In this case, the transpose is the thing that hurts the performance since an entire new matrix must be created before the broadcasted .+ can work. This is why the for loop is much faster - you don’t need to create a new matrix. (As a side note, I’d hoped that the new Transpose matrix wrapper would fix this but apparently not; though it might get fixed in the future.)
  3. The behaviour of Format 2 and Format 3 are different; Format 2 preserves the argument A (i.e., leaves it unchanged outside the function) whereas Format 3 mutates the argument A (i.e., after the function is called the value of A has changed outside as well). This may or may not matter to you.
  4. When you are timing your code you are including the compilation time. Often this is irrelevant if you are calling the function repeatedly. It is better to use BenchmarkTools.jl and the @btime macro.

Ultimately the StaticArrays version is going to be the best and is closest to what the Fortran version is actually doing. The key thing is that StaticArrays carry all their size information within the type and so the compiler can do lots of optimisations, whereas the generic Matrix uses runtime size information which means the compiler can’t do so much.

5 Likes

This will not give a correct output, c will not be changed. The first dot in A .= A .+ A' must be removed.

I’m not sure why you say that. Comparing the results from your version and the version with the extra dots gives equality.

function loop_over_global(A)
    c = 0.0
    B = copy(A)
    for i = 1:10000000
        A  = B ./ i
        B  = A + A'
        c += sum(B)
    end
    return c
end

function loop_over_global_dotted(A)
    c=0.0
    for i=1:10000000
        A .= A./i
        A .= A.+A'
        c+=sum(A)
    end
    return c
end

julia> loop_over_global(@SMatrix [2.0 4 1; 4 1 4; 4 2 3]) == loop_over_global_dotted([2.0 4 1; 4 1 4; 4 2 3])
true

version with dot removed:

julia> function loop_over_global(A::Array{Float64,2})
           c = 0.0
           for i = 1:10000000
               A .= A ./ i
               A = A .+ A'
               c += sum(A)
           end
           return c
       end
loop_over_global (generic function with 1 method)

julia> A=[2.0 4 1; 4 1 4; 4 2 3]
3×3 Array{Float64,2}:
 2.0  4.0  1.0
 4.0  1.0  4.0
 4.0  2.0  3.0

julia> @btime loop_over_global($A)
  550.793 ms (10000000 allocations: 1.49 GiB)
159.72640247326618

Version with one more dot:

julia> function loop_over_global(A::Array{Float64,2})
           c = 0.0
           for i = 1:10000000
               A .= A ./ i
               A .= A .+ A' # <---
               c += sum(A)
           end
           return c
       end
loop_over_global (generic function with 1 method)

julia> A=[2.0 4 1; 4 1 4; 4 2 3]
3×3 Array{Float64,2}:
 2.0  4.0  1.0
 4.0  1.0  4.0
 4.0  2.0  3.0

julia> @btime loop_over_global($A)
  570.866 ms (10000000 allocations: 1.49 GiB)
0.0

Very strange:

julia> @btime loop_over_global_dotted($A)
  573.440 ms (10000000 allocations: 1.49 GiB)
0.0

julia> @btime loop_over_global_dotted([2.0 4 1; 4 1 4; 4 2 3])
  574.914 ms (10000012 allocations: 1.49 GiB)
159.72640247326618

Interpolation $A makes the problem!

Yes. What you mentioned about leaving out the last dot means that it only mutates the input on the first iteration (it stops mutating after the assignment). It happens to be that on the first iteration the division is by 1 and so the input isn’t changed.

Putting in all the dots means that the input gets mutated all the way through, leading to a matrix of zeros by the end. This then gets used as the starting value for the next benchmark timing and so the final result of the final call is zero.

Yes, I see, do you see it normal or it may be considered a bug in BenchmarkTools?

The problem is you’re overwriting A.

julia> function loop_over_global(A::Array{Float64,2})
           c = 0.0
           for i = 1:10000000
               A .= A ./ i
               A .= A .+ A' # <---
               c += sum(A)
           end
           return c
       end
loop_over_global (generic function with 1 method)

julia> A=[2.0 4 1; 4 1 4; 4 2 3]
3×3 Array{Float64,2}:
 2.0  4.0  1.0
 4.0  1.0  4.0
 4.0  2.0  3.0

julia> A
3×3 Array{Float64,2}:
 2.0  4.0  1.0
 4.0  1.0  4.0
 4.0  2.0  3.0

julia> loop_over_global(A)
159.72640247326618

julia> A
3×3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> loop_over_global(A)
0.0

When instead of interpolating, you create a new matrix to call the function on with loop_over_global_dotted([2.0 4 1; 4 1 4; 4 2 3]), it will work.

EDIT: Removing the second of the .= works, because on the first iteration you just divide A ./= 1. Then you create a copy and reassign, so the original A never gets overwritten with anything different.

1 Like

Yes, I understand that at the end of the first run, A is equal to zeros and all subsequent runs are using that value as input because A is mutated. What I expected is that BenchmarkTools should have protected A and used a local copy at the beginning of each loop and not just at the beginning of the first one. The question is: should this be considered normal or a bug? I may be wrong, but this strange behavior appears in many places, try for example @btime $A .+ $B vs. @btime $A + $B, which do you think should be faster?

Use @benchmarkable foo(A) setup=(setupfunctions...) to work around this. See the manual for details.

julia> x = rand(100000);

julia> b = @benchmarkable sort!(y) setup=(y = copy($x))
Benchmark(evals=1, seconds=5.0, samples=10000)

julia> run(b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.077 ms (0.00% GC)
  median time:      5.398 ms (0.00% GC)
  mean time:        5.763 ms (0.00% GC)
  maximum time:     8.652 ms (0.00% GC)
  --------------
  samples:          850
  evals/sample:     1

julia> x' # not sorted
1×100000 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.457684  0.920921  0.594477  0.99042  0.516169  0.240637  0.163594  …  0.228337  0.893383  0.332289  0.0195335  0.944543  0.394173
1 Like