Trying to speed-up Laplace Equation Solver

I am trying to speedup my code for Laplace Equation Solver. Based on code profiling, most of the time goes in the function given below. I have tried to optimise the functions as much as possible but the performance still isn’t near an available FORTRAN code (which runs 33% faster). Can anyone tell me how can I improve my code beyond this?

    function calcNext!(Anew,A,m,n)
        error = 0.0::Float64;
        @inbounds for j=2:(m-1)  , i = 2:(n-1)
             @fastmath Anew[i,j] = 0.25*( A[i+1, j ]
                              + A[i-1, j ]
                              + A[ i ,j-1]
                              + A[ i ,j+1]    );
            error = max(error, abs( Anew[i,j] - A[i,j] ) );
        end
        return(error);
    end

The following is the time taken by each part of the code, as per TimerOutputs.

 ──────────────────────────────────────────────────────────────────────
                               Time                   Allocations
                       ──────────────────────   ───────────────────────
   Tot / % measured:        9.90s / 100%            12.4KiB / 86.2%

 Section       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────
 Loop               1    9.90s   100%   9.90s   10.7KiB  100%   10.7KiB
   calcNext!    1.00k    6.92s  69.9%  6.92ms     0.00B  0.00%    0.00B
   Copy         1.00k    2.96s  29.9%  2.96ms     0.00B  0.00%    0.00B
   Print        1.00k   13.9ms  0.14%  13.9ΞΌs   8.19KiB  76.7%    8.38B
 ──────────────────────────────────────────────────────────────────────

It looks like your solver is Jacobi iteration. You might do better with a different solver rather than tuning this one.

Or is your loop a smoother for multigrid?

I just want to see how I can make Julia as fast as Fortran. I used a FORTRAN code given by NVIDIA for this. The function I pasted in this question is based on the FORTRAN function given below.

    function calcNext(A, Anew, m, n)
      integer, parameter          :: fp_kind=kind(1.0d0)
      real(fp_kind),intent(inout) :: A(0:n-1,0:m-1)
      real(fp_kind),intent(inout) :: Anew(0:n-1,0:m-1)
      integer,intent(in)          :: m, n
      integer                     :: i, j
      real(fp_kind)               :: error

      error=0.0_fp_kind

      !$acc parallel loop default(present) &
      !$acc   reduction(max:error)
      do j=1,m-2
        do i=1,n-2
          Anew(i,j) = 0.25_fp_kind * ( A(i+1,j  ) + A(i-1,j  ) + &
                                       A(i  ,j-1) + A(i  ,j+1) )
          error = max( error, abs(Anew(i,j)-A(i,j)) )
        end do
      end do
      calcNext = error
    end function calcNext

Julia takes 34 seconds and FORTRAN takes 28 seconds (20% faster). I ran the FORTRAN code with single thread, so the lines with !$acc can be ignored.

You can attain a significant speedup by moving the @fastmath annotation to the outer loop to allow it to operate on the whole loop body, and it’s faster yet with LoopVectorization.jl’s @avx macro:


function calcNext0!(Anew,A)
    m, n = size(A)
    error = 0.0::Float64;
    @inbounds for j=2:(m-1)
        for i = 2:(n-1)
            @fastmath Anew[i,j] = 0.25*( A[i+1, j ]
                            + A[i-1, j ]
                            + A[ i ,j-1]
                            + A[ i ,j+1]    );
            error = max(error, abs( Anew[i,j] - A[i,j] ) );
        end
    end
    return error;
end

function calcNext1!(Anew,A)
    m, n = size(A)
    error = 0.0::Float64;
    @inbounds @fastmath for j=2:(m-1)
        for i = 2:(n-1)
            Anew[i,j] = 0.25*( A[i+1, j ]
                            + A[i-1, j ]
                            + A[ i ,j-1]
                            + A[ i ,j+1]    );
            error = max(error, abs( Anew[i,j] - A[i,j] ) );
        end
    end
    return error;
end

function calcNext2!(Anew, A)
    error = zero(eltype(A))
    m, n = size(A)
    @avx for j = 2:(m-1)
        for i = 2:(n-1)
            Anew[i,j] = 0.25*( A[i+1, j ]
                             + A[i-1, j ]
                             + A[ i ,j-1]
                             + A[ i ,j+1])
            error = max(error, abs(Anew[i,j] - A[i,j]))
        end
    end
    return error
end
julia> A = zeros(512, 512)

julia> Anew = copy(A)

julia> @btime calcNext0!(Anew, A)
  920.201 ΞΌs (1 allocation: 16 bytes)
0.5868908386484257

julia> @btime calcNext1!(Anew, A)
  430.900 ΞΌs (1 allocation: 16 bytes)
0.5868908386484257

julia> @btime calcNext2!(Anew, A)
  356.600 ΞΌs (1 allocation: 16 bytes)
0.5868908386484257
4 Likes

Check out the Julia code in this benchmark: Interesting NASA "Language-Comparison" repository

1 Like

Thanks a lot ! I got the speedup by moving @fastmath annotation to the outer loop. However, I got an error when I tried to use @avc macro. Also, (probably unrelated) the LoopVectorisation.jl package installation showed error for vsub.

julia> @btime calcNext0!(A,Anew)
  755.200 ΞΌs (1 allocation: 16 bytes)

julia> @btime calcNext1!(A,Anew)
  300.799 ΞΌs (1 allocation: 16 bytes)

julia> calcNext2!(A,Anew)
ERROR: BoundsError: attempt to access (LoopVectorization.LoopValue(), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), LoopVectorization.LoopValue(), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000013d2080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), 0.0, abs)
  at index [11]
Stacktrace:
 [1] getindex(::Tuple, ::Int64) at .\tuple.jl:24
 [2] macro expansion at C:\Users\Ashok Kumar Mishra\.julia\packages\LoopVectorization\sJUjl\src\reconstruct_loopset.jl:268 [inlined]
 [3] _avx_! at C:\Users\Ashok Kumar Mishra\.julia\packages\LoopVectorization\sJUjl\src\reconstruct_loopset.jl:268 [inlined]
 [4] calcNext2!(::Array{Float64,2}, ::Array{Float64,2}) at C:\Users\Ashok Kumar Mishra\Documents\My Google Drive\RESEARCH\GIT\ShagunWork\Trials.jl:34
 [5] top-level scope at none:0

What version of Julia are you on? What version of LoopVectorization?
Your error shows:

_avx_! at C:\Users\Ashok Kumar Mishra\.julia\packages\LoopVectorization\sJUjl\src\reconstruct_loopset.jl:268

But _avx_! hasn’t been at src\reconstruct_loopset.jl:268 for a very long time (the line number should be much higher).

LoopVectorization won’t work on Julia 1.0, but should work on 1.1 and newer.
Julia 1.4 or newer are recommended (for much faster compile times).

I am on Julia 1.4.2.

And what version of LoopVectorization?

julia> using Pkg

julia> Pkg.status("LoopVectorization")
Status `~/.julia/environments/v1.6/Project.toml`
  [bdcacae8] LoopVectorization v0.8.20 `~/.julia/dev/LoopVectorization`

(It shouldn’t be in dev for you, but update your packages if it isn’t 0.8.20.)

I just updated my package and the version changed from 0.6.21 =>0.6.30.
Should I try to get it from the repository and install?

Ok a funny thing happened. I installed the package in my WSL-2 Julia installation and its version is at 0.8.13. In my Windows installation, i was only able to install upto 0.6.30. I guess this issue cannot be reproduced.

Maybe you have a packing on your Windows installation holding it back?
What if you try ] add LoopVectorization@0.8?

FWIW, I get:

julia> A = rand(512, 512); Anew = similar(A);

julia> @btime calcNext0!($Anew, $A)
  752.933 ΞΌs (0 allocations: 0 bytes)
0.9497299087695734

julia> @btime calcNext1!($Anew, $A)
  274.262 ΞΌs (0 allocations: 0 bytes)
0.9497299087695734

julia> @btime calcNext2!($Anew, $A)
  205.201 ΞΌs (0 allocations: 0 bytes)
0.9497299087695734

I got a superb speedup in WSL-2 Windows 10 (quite surprised that the package didn’t upload to its newest version in Windows 10 system installation). @stillyslalom thanks a lot lot lot !!!

julia> @btime calcNext0!(A,Anew)
  758.000 ΞΌs (1 allocation: 16 bytes)

julia> @btime calcNext1!(A,Anew)
  299.400 ΞΌs (1 allocation: 16 bytes)

julia> @btime calcNext2!(A,Anew)
  179.300 ΞΌs (1 allocation: 16 bytes)
2 Likes

I did this just now and then updated packages. This is what happened.

(@v1.4) pkg> add LoopVectorization@0.8
   Updating registry at `C:\Users\Ashok Kumar Mishra\.julia\registries\General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
  Installed RecursiveFactorization ─ v0.1.4
  Installed VectorizationBase ────── v0.12.29
  Installed SIMDPirates ──────────── v0.8.21
  Installed SLEEFPirates ─────────── v0.5.5
  Installed NNlib ────────────────── v0.7.3
  Installed Arpack ───────────────── v0.3.2
  Installed FinEtools ────────────── v4.0.5
  Installed Zygote ───────────────── v0.5.3
  Installed LoopVectorization ────── v0.8.20
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Project.toml`
  [91bb5406] ↓ FinEtools v4.3.0 β‡’ v4.0.5
  [bdcacae8] ↑ LoopVectorization v0.6.30 β‡’ v0.8.20
  [872c559c] ↑ NNlib v0.7.1 β‡’ v0.7.3
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Manifest.toml`
  [7d9fca2a] ↓ Arpack v0.4.0 β‡’ v0.3.2
  [68821587] - Arpack_jll v3.5.0+3
  [91bb5406] ↓ FinEtools v4.3.0 β‡’ v4.0.5
  [bdcacae8] ↑ LoopVectorization v0.6.30 β‡’ v0.8.20
  [872c559c] ↑ NNlib v0.7.1 β‡’ v0.7.3
  [f2c3362d] ↑ RecursiveFactorization v0.1.0 β‡’ v0.1.4
  [21efa798] ↑ SIMDPirates v0.7.17 β‡’ v0.8.21
  [476501e8] ↑ SLEEFPirates v0.4.7 β‡’ v0.5.5
  [3a884ed6] ↑ UnPack v0.1.0 β‡’ v1.0.1
  [3d5dd08c] ↑ VectorizationBase v0.10.5 β‡’ v0.12.29
  [e88e6eb3] ↑ Zygote v0.5.2 β‡’ v0.5.3
   Building SLEEFPirates β†’ `C:\Users\Ashok Kumar Mishra\.julia\packages\SLEEFPirates\jGsib\deps\build.log`
   Building Arpack ──────→ `C:\Users\Ashok Kumar Mishra\.julia\packages\Arpack\zCmTA\deps\build.log`

(@v1.4) pkg> update
   Updating registry at `C:\Users\Ashok Kumar Mishra\.julia\registries\General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Project.toml`
  [91bb5406] ↑ FinEtools v4.0.5 β‡’ v4.3.0
  [bdcacae8] ↓ LoopVectorization v0.8.20 β‡’ v0.6.30
  [872c559c] ↓ NNlib v0.7.3 β‡’ v0.7.1
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Manifest.toml`
  [7d9fca2a] ↑ Arpack v0.3.2 β‡’ v0.4.0
  [68821587] + Arpack_jll v3.5.0+3
  [91bb5406] ↑ FinEtools v4.0.5 β‡’ v4.3.0
  [bdcacae8] ↓ LoopVectorization v0.8.20 β‡’ v0.6.30
  [872c559c] ↓ NNlib v0.7.3 β‡’ v0.7.1
  [f2c3362d] ↓ RecursiveFactorization v0.1.4 β‡’ v0.1.0
  [21efa798] ↓ SIMDPirates v0.8.21 β‡’ v0.7.17
  [476501e8] ↓ SLEEFPirates v0.5.5 β‡’ v0.4.7
  [3a884ed6] ↓ UnPack v1.0.1 β‡’ v0.1.0
  [3d5dd08c] ↓ VectorizationBase v0.12.29 β‡’ v0.10.5
  [e88e6eb3] ↓ Zygote v0.5.3 β‡’ v0.5.2

Maybe I should open another thread for this, its unrelated to the topic @Elrod ?

Aside from the other improvements suggested in this thread, I would say that a factor of 1.33 is pretty β€œnear”. This is definitely within the variance that you would see from typical micro-optimizations and compiler tweaksβ€”it’s like the difference you might see from switching from one Fortran compiler to another Fortran compiler (as opposed to the orders-of-magnitude change you would see by going to non-vectorized Python or Matlab code).

And it will be completely swamped by what you would get from algorithmic improvements (e.g. switching to a sparse-direct solver), which are much easier in a higher-level language.

2 Likes

That should be solved via this PR.

EDIT:
Also of interest, using LinuxPerf.jl (I removed a run to compile foreachf!):

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method) 

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(calcNext2!, 10_000, Anew, A))
  2.104791 seconds (10.00 k allocations: 156.312 KiB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               8.60e+09   30.1%  #  4.1 cycles per ns
β”Œ instructions             3.54e+09   30.0%  #  0.4 insns per cycle
β”‚ branch-instructions      4.54e+07   30.0%  #  1.3% of instructions
β”” branch-misses            2.11e+05   30.0%  #  0.5% of branch instructions
β”Œ task-clock               2.10e+09  100.0%
β”‚ context-switches         4.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    9.07e+08   10.0%  # 78.8% of loads
β”‚ L1-dcache-loads          1.15e+09   10.0%
β”” L1-icache-load-misses    4.90e+06   10.0%
β”Œ dTLB-load-misses         8.52e+02   10.0%
β”” dTLB-loads               1.15e+09   10.0%
β”Œ iTLB-load-misses         3.83e+04   20.0%
β”” iTLB-loads               1.32e+03   20.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @btime calcNext2!($Anew, $A)
  194.180 ΞΌs (0 allocations: 0 bytes)
0.9259296647327868

0.4 insns per clock cycle is really bad, as is 78.8% of loads being L1-d cache misses.
For comparison, here is the @pstats for calcNext1!:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(calcNext1!, 10_000, Anew, A))
  3.097806 seconds (10.01 k allocations: 156.469 KiB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               1.42e+10   30.0%  #  4.6 cycles per ns
β”Œ instructions             3.13e+10   10.0%  #  2.2 insns per cycle
β”‚ branch-instructions      2.61e+09   10.0%  #  8.3% of instructions
β”” branch-misses            5.24e+06   10.0%  #  0.2% of branch instructions
β”Œ task-clock               3.10e+09  100.0%
β”‚ context-switches         4.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    6.56e+08   10.0%  #  5.0% of loads
β”‚ L1-dcache-loads          1.30e+10   10.0%
β”” L1-icache-load-misses    3.70e+06   10.0%
β”Œ dTLB-load-misses         7.42e+02   10.0%
β”” dTLB-loads               1.30e+10   10.0%
β”Œ iTLB-load-misses         4.36e+04   19.9%
β”” iTLB-loads               4.21e+03   19.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Only 5% of loads are misses, 2.2 instructions/cycle, and 4.6 cycles/nanosecond (instead of 4.1). But it also required nearly 9x the instructions.

1 Like

If you’re having issues upgrading a package, it usually means that one of the packages in your environment has a dependency that limits the version number of the package that you want to upgrade. Often you can solve this by creating a new environment and only installing the bare minimum packages that you need for the project in that environment. See the Pkg docs here and here for more info.

1 Like

Thanks for elaborating. That package is FinEtools. You can spot it in the list of packages as downgrading when it upgrades and upgrading when LoopVectorization downgrades.
If the newer version is compatible with the package placing the upper bound, you could also create a PR to relax the bounds.

2 Likes

You’re right, FinEtools caused the problem and I was able to run the code after removing FinEtools and updating packages. Also thanks for creating the PR. I am a noob to helping the developers that way, and will learn how to do it.

@CameronBieganek I guess I never saw the need to learn to create virtual environments, but you are absolutely right that I have to use it in this case. Thanks for the pointer!

1 Like