Trying to speed-up Laplace Equation Solver

SambitMishra98 · July 29, 2020, 9:25am

I am trying to speedup my code for Laplace Equation Solver. Based on code profiling, most of the time goes in the function given below. I have tried to optimise the functions as much as possible but the performance still isn’t near an available FORTRAN code (which runs 33% faster). Can anyone tell me how can I improve my code beyond this?

    function calcNext!(Anew,A,m,n)
        error = 0.0::Float64;
        @inbounds for j=2:(m-1)  , i = 2:(n-1)
             @fastmath Anew[i,j] = 0.25*( A[i+1, j ]
                              + A[i-1, j ]
                              + A[ i ,j-1]
                              + A[ i ,j+1]    );
            error = max(error, abs( Anew[i,j] - A[i,j] ) );
        end
        return(error);
    end

The following is the time taken by each part of the code, as per TimerOutputs.

 ──────────────────────────────────────────────────────────────────────
                               Time                   Allocations
                       ──────────────────────   ───────────────────────
   Tot / % measured:        9.90s / 100%            12.4KiB / 86.2%

 Section       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────
 Loop               1    9.90s   100%   9.90s   10.7KiB  100%   10.7KiB
   calcNext!    1.00k    6.92s  69.9%  6.92ms     0.00B  0.00%    0.00B
   Copy         1.00k    2.96s  29.9%  2.96ms     0.00B  0.00%    0.00B
   Print        1.00k   13.9ms  0.14%  13.9μs   8.19KiB  76.7%    8.38B
 ──────────────────────────────────────────────────────────────────────

ctkelley · July 29, 2020, 10:53am

It looks like your solver is Jacobi iteration. You might do better with a different solver rather than tuning this one.

Or is your loop a smoother for multigrid?

SambitMishra98 · July 29, 2020, 11:06am

I just want to see how I can make Julia as fast as Fortran. I used a FORTRAN code given by NVIDIA for this. The function I pasted in this question is based on the FORTRAN function given below.

    function calcNext(A, Anew, m, n)
      integer, parameter          :: fp_kind=kind(1.0d0)
      real(fp_kind),intent(inout) :: A(0:n-1,0:m-1)
      real(fp_kind),intent(inout) :: Anew(0:n-1,0:m-1)
      integer,intent(in)          :: m, n
      integer                     :: i, j
      real(fp_kind)               :: error

      error=0.0_fp_kind

      !$acc parallel loop default(present) &
      !$acc   reduction(max:error)
      do j=1,m-2
        do i=1,n-2
          Anew(i,j) = 0.25_fp_kind * ( A(i+1,j  ) + A(i-1,j  ) + &
                                       A(i  ,j-1) + A(i  ,j+1) )
          error = max( error, abs(Anew(i,j)-A(i,j)) )
        end do
      end do
      calcNext = error
    end function calcNext

Julia takes 34 seconds and FORTRAN takes 28 seconds (20% faster). I ran the FORTRAN code with single thread, so the lines with !$acc can be ignored.

stillyslalom · July 29, 2020, 1:43pm

You can attain a significant speedup by moving the @fastmath annotation to the outer loop to allow it to operate on the whole loop body, and it’s faster yet with LoopVectorization.jl’s @avx macro:


function calcNext0!(Anew,A)
    m, n = size(A)
    error = 0.0::Float64;
    @inbounds for j=2:(m-1)
        for i = 2:(n-1)
            @fastmath Anew[i,j] = 0.25*( A[i+1, j ]
                            + A[i-1, j ]
                            + A[ i ,j-1]
                            + A[ i ,j+1]    );
            error = max(error, abs( Anew[i,j] - A[i,j] ) );
        end
    end
    return error;
end

function calcNext1!(Anew,A)
    m, n = size(A)
    error = 0.0::Float64;
    @inbounds @fastmath for j=2:(m-1)
        for i = 2:(n-1)
            Anew[i,j] = 0.25*( A[i+1, j ]
                            + A[i-1, j ]
                            + A[ i ,j-1]
                            + A[ i ,j+1]    );
            error = max(error, abs( Anew[i,j] - A[i,j] ) );
        end
    end
    return error;
end

function calcNext2!(Anew, A)
    error = zero(eltype(A))
    m, n = size(A)
    @avx for j = 2:(m-1)
        for i = 2:(n-1)
            Anew[i,j] = 0.25*( A[i+1, j ]
                             + A[i-1, j ]
                             + A[ i ,j-1]
                             + A[ i ,j+1])
            error = max(error, abs(Anew[i,j] - A[i,j]))
        end
    end
    return error
end

julia> A = zeros(512, 512)

julia> Anew = copy(A)

julia> @btime calcNext0!(Anew, A)
  920.201 μs (1 allocation: 16 bytes)
0.5868908386484257

julia> @btime calcNext1!(Anew, A)
  430.900 μs (1 allocation: 16 bytes)
0.5868908386484257

julia> @btime calcNext2!(Anew, A)
  356.600 μs (1 allocation: 16 bytes)
0.5868908386484257

PetrKryslUCSD · July 29, 2020, 2:56pm

Check out the Julia code in this benchmark: Interesting NASA "Language-Comparison" repository - #26 by PetrKryslUCSD

SambitMishra98 · July 29, 2020, 3:22pm

Thanks a lot ! I got the speedup by moving @fastmath annotation to the outer loop. However, I got an error when I tried to use @avc macro. Also, (probably unrelated) the LoopVectorisation.jl package installation showed error for vsub.

julia> @btime calcNext0!(A,Anew)
  755.200 μs (1 allocation: 16 bytes)

julia> @btime calcNext1!(A,Anew)
  300.799 μs (1 allocation: 16 bytes)

julia> calcNext2!(A,Anew)
ERROR: BoundsError: attempt to access (LoopVectorization.LoopValue(), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), LoopVectorization.LoopValue(), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000013d2080, (512,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00000000015ed080, (512,)), 0.0, abs)
  at index [11]
Stacktrace:
 [1] getindex(::Tuple, ::Int64) at .\tuple.jl:24
 [2] macro expansion at C:\Users\Ashok Kumar Mishra\.julia\packages\LoopVectorization\sJUjl\src\reconstruct_loopset.jl:268 [inlined]
 [3] _avx_! at C:\Users\Ashok Kumar Mishra\.julia\packages\LoopVectorization\sJUjl\src\reconstruct_loopset.jl:268 [inlined]
 [4] calcNext2!(::Array{Float64,2}, ::Array{Float64,2}) at C:\Users\Ashok Kumar Mishra\Documents\My Google Drive\RESEARCH\GIT\ShagunWork\Trials.jl:34
 [5] top-level scope at none:0

Elrod · July 29, 2020, 3:32pm

What version of Julia are you on? What version of LoopVectorization?
Your error shows:

_avx_! at C:\Users\Ashok Kumar Mishra\.julia\packages\LoopVectorization\sJUjl\src\reconstruct_loopset.jl:268

But _avx_! hasn’t been at src\reconstruct_loopset.jl:268 for a very long time (the line number should be much higher).

LoopVectorization won’t work on Julia 1.0, but should work on 1.1 and newer.
Julia 1.4 or newer are recommended (for much faster compile times).

SambitMishra98 · July 29, 2020, 3:39pm

I am on Julia 1.4.2.

Elrod · July 29, 2020, 3:40pm

And what version of LoopVectorization?

julia> using Pkg

julia> Pkg.status("LoopVectorization")
Status `~/.julia/environments/v1.6/Project.toml`
  [bdcacae8] LoopVectorization v0.8.20 `~/.julia/dev/LoopVectorization`

(It shouldn’t be in dev for you, but update your packages if it isn’t 0.8.20.)

SambitMishra98 · July 29, 2020, 3:44pm

I just updated my package and the version changed from 0.6.21 =>0.6.30.
Should I try to get it from the repository and install?

SambitMishra98 · July 29, 2020, 3:47pm

Ok a funny thing happened. I installed the package in my WSL-2 Julia installation and its version is at 0.8.13. In my Windows installation, i was only able to install upto 0.6.30. I guess this issue cannot be reproduced.

Elrod · July 29, 2020, 3:55pm

Maybe you have a packing on your Windows installation holding it back?
What if you try ] add LoopVectorization@0.8?

FWIW, I get:

julia> A = rand(512, 512); Anew = similar(A);

julia> @btime calcNext0!($Anew, $A)
  752.933 μs (0 allocations: 0 bytes)
0.9497299087695734

julia> @btime calcNext1!($Anew, $A)
  274.262 μs (0 allocations: 0 bytes)
0.9497299087695734

julia> @btime calcNext2!($Anew, $A)
  205.201 μs (0 allocations: 0 bytes)
0.9497299087695734

SambitMishra98 · July 29, 2020, 3:56pm

I got a superb speedup in WSL-2 Windows 10 (quite surprised that the package didn’t upload to its newest version in Windows 10 system installation). @stillyslalom thanks a lot lot lot !!!

julia> @btime calcNext0!(A,Anew)
  758.000 μs (1 allocation: 16 bytes)

julia> @btime calcNext1!(A,Anew)
  299.400 μs (1 allocation: 16 bytes)

julia> @btime calcNext2!(A,Anew)
  179.300 μs (1 allocation: 16 bytes)

SambitMishra98 · July 29, 2020, 4:01pm

I did this just now and then updated packages. This is what happened.

(@v1.4) pkg> add LoopVectorization@0.8
   Updating registry at `C:\Users\Ashok Kumar Mishra\.julia\registries\General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
  Installed RecursiveFactorization ─ v0.1.4
  Installed VectorizationBase ────── v0.12.29
  Installed SIMDPirates ──────────── v0.8.21
  Installed SLEEFPirates ─────────── v0.5.5
  Installed NNlib ────────────────── v0.7.3
  Installed Arpack ───────────────── v0.3.2
  Installed FinEtools ────────────── v4.0.5
  Installed Zygote ───────────────── v0.5.3
  Installed LoopVectorization ────── v0.8.20
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Project.toml`
  [91bb5406] ↓ FinEtools v4.3.0 ⇒ v4.0.5
  [bdcacae8] ↑ LoopVectorization v0.6.30 ⇒ v0.8.20
  [872c559c] ↑ NNlib v0.7.1 ⇒ v0.7.3
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Manifest.toml`
  [7d9fca2a] ↓ Arpack v0.4.0 ⇒ v0.3.2
  [68821587] - Arpack_jll v3.5.0+3
  [91bb5406] ↓ FinEtools v4.3.0 ⇒ v4.0.5
  [bdcacae8] ↑ LoopVectorization v0.6.30 ⇒ v0.8.20
  [872c559c] ↑ NNlib v0.7.1 ⇒ v0.7.3
  [f2c3362d] ↑ RecursiveFactorization v0.1.0 ⇒ v0.1.4
  [21efa798] ↑ SIMDPirates v0.7.17 ⇒ v0.8.21
  [476501e8] ↑ SLEEFPirates v0.4.7 ⇒ v0.5.5
  [3a884ed6] ↑ UnPack v0.1.0 ⇒ v1.0.1
  [3d5dd08c] ↑ VectorizationBase v0.10.5 ⇒ v0.12.29
  [e88e6eb3] ↑ Zygote v0.5.2 ⇒ v0.5.3
   Building SLEEFPirates → `C:\Users\Ashok Kumar Mishra\.julia\packages\SLEEFPirates\jGsib\deps\build.log`
   Building Arpack ──────→ `C:\Users\Ashok Kumar Mishra\.julia\packages\Arpack\zCmTA\deps\build.log`

(@v1.4) pkg> update
   Updating registry at `C:\Users\Ashok Kumar Mishra\.julia\registries\General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Project.toml`
  [91bb5406] ↑ FinEtools v4.0.5 ⇒ v4.3.0
  [bdcacae8] ↓ LoopVectorization v0.8.20 ⇒ v0.6.30
  [872c559c] ↓ NNlib v0.7.3 ⇒ v0.7.1
   Updating `C:\Users\Ashok Kumar Mishra\.julia\environments\v1.4\Manifest.toml`
  [7d9fca2a] ↑ Arpack v0.3.2 ⇒ v0.4.0
  [68821587] + Arpack_jll v3.5.0+3
  [91bb5406] ↑ FinEtools v4.0.5 ⇒ v4.3.0
  [bdcacae8] ↓ LoopVectorization v0.8.20 ⇒ v0.6.30
  [872c559c] ↓ NNlib v0.7.3 ⇒ v0.7.1
  [f2c3362d] ↓ RecursiveFactorization v0.1.4 ⇒ v0.1.0
  [21efa798] ↓ SIMDPirates v0.8.21 ⇒ v0.7.17
  [476501e8] ↓ SLEEFPirates v0.5.5 ⇒ v0.4.7
  [3a884ed6] ↓ UnPack v1.0.1 ⇒ v0.1.0
  [3d5dd08c] ↓ VectorizationBase v0.12.29 ⇒ v0.10.5
  [e88e6eb3] ↓ Zygote v0.5.3 ⇒ v0.5.2

Maybe I should open another thread for this, its unrelated to the topic @Elrod ?

stevengj · July 29, 2020, 4:29pm

Aside from the other improvements suggested in this thread, I would say that a factor of 1.33 is pretty “near”. This is definitely within the variance that you would see from typical micro-optimizations and compiler tweaks—it’s like the difference you might see from switching from one Fortran compiler to another Fortran compiler (as opposed to the orders-of-magnitude change you would see by going to non-vectorized Python or Matlab code).

And it will be completely swamped by what you would get from algorithmic improvements (e.g. switching to a sparse-direct solver), which are much easier in a higher-level language.

Elrod · July 29, 2020, 4:31pm

That should be solved via this PR.

EDIT:
Also of interest, using LinuxPerf.jl (I removed a run to compile foreachf!):

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method) 

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(calcNext2!, 10_000, Anew, A))
  2.104791 seconds (10.00 k allocations: 156.312 KiB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               8.60e+09   30.1%  #  4.1 cycles per ns
┌ instructions             3.54e+09   30.0%  #  0.4 insns per cycle
│ branch-instructions      4.54e+07   30.0%  #  1.3% of instructions
└ branch-misses            2.11e+05   30.0%  #  0.5% of branch instructions
┌ task-clock               2.10e+09  100.0%
│ context-switches         4.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.07e+08   10.0%  # 78.8% of loads
│ L1-dcache-loads          1.15e+09   10.0%
└ L1-icache-load-misses    4.90e+06   10.0%
┌ dTLB-load-misses         8.52e+02   10.0%
└ dTLB-loads               1.15e+09   10.0%
┌ iTLB-load-misses         3.83e+04   20.0%
└ iTLB-loads               1.32e+03   20.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @btime calcNext2!($Anew, $A)
  194.180 μs (0 allocations: 0 bytes)
0.9259296647327868

0.4 insns per clock cycle is really bad, as is 78.8% of loads being L1-d cache misses.
For comparison, here is the @pstats for calcNext1!:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(calcNext1!, 10_000, Anew, A))
  3.097806 seconds (10.01 k allocations: 156.469 KiB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.42e+10   30.0%  #  4.6 cycles per ns
┌ instructions             3.13e+10   10.0%  #  2.2 insns per cycle
│ branch-instructions      2.61e+09   10.0%  #  8.3% of instructions
└ branch-misses            5.24e+06   10.0%  #  0.2% of branch instructions
┌ task-clock               3.10e+09  100.0%
│ context-switches         4.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.56e+08   10.0%  #  5.0% of loads
│ L1-dcache-loads          1.30e+10   10.0%
└ L1-icache-load-misses    3.70e+06   10.0%
┌ dTLB-load-misses         7.42e+02   10.0%
└ dTLB-loads               1.30e+10   10.0%
┌ iTLB-load-misses         4.36e+04   19.9%
└ iTLB-loads               4.21e+03   19.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Only 5% of loads are misses, 2.2 instructions/cycle, and 4.6 cycles/nanosecond (instead of 4.1). But it also required nearly 9x the instructions.

CameronBieganek · July 29, 2020, 4:41pm

If you’re having issues upgrading a package, it usually means that one of the packages in your environment has a dependency that limits the version number of the package that you want to upgrade. Often you can solve this by creating a new environment and only installing the bare minimum packages that you need for the project in that environment. See the Pkg docs here and here for more info.

Elrod · July 29, 2020, 4:45pm

Thanks for elaborating. That package is FinEtools. You can spot it in the list of packages as downgrading when it upgrades and upgrading when LoopVectorization downgrades.
If the newer version is compatible with the package placing the upper bound, you could also create a PR to relax the bounds.

SambitMishra98 · July 29, 2020, 5:57pm

You’re right, FinEtools caused the problem and I was able to run the code after removing FinEtools and updating packages. Also thanks for creating the PR. I am a noob to helping the developers that way, and will learn how to do it.

SambitMishra98 · July 29, 2020, 5:59pm

@CameronBieganek I guess I never saw the need to learn to create virtual environments, but you are absolutely right that I have to use it in this case. Thanks for the pointer!

Topic		Replies	Views
Poor performance due to memory allocations? Performance memory-allocation	17	2901	January 15, 2019
Understanding Julia performance in simple finite difference code New to Julia	38	4559	June 7, 2018
Optimize a simple code for performance General Usage	3	686	January 20, 2019
Julia vs Matlab - building a Jacobian matrix New to Julia	30	4315	March 28, 2019
Reducing allocations to optimize (to match Java speed) Performance	22	1405	December 10, 2018

Trying to speed-up Laplace Equation Solver

Related topics