Trying to speed-up Laplace Equation Solver

@stevengj I agree its fast enough for practical purposes. I just wanted to be sure that I have tried out all possible options and achieving the maximum speedup I can get. Its just for my practice with the language.

@PetrKryslUCSD This is a great link you directed me to! In fact, I was trying this with NVIDIA codes for C and Fortran.

Just curious: Why is the registered LoopVectorization stuck at 0.6.30?

I tried these improvements in three places: Windows 10 System, Ubuntu 20.04 WSL-2 in the same Windows 10 System and Ubuntu 16.04 Server. Sepcifically, I compared using @inbounds @fastmath and using @avx.

I was able to get improvements in Windows 10 as expected. My code ran 25% faster after using @avx macro. However, surprisingly, I could not find any difference whatsoever in the WSL-2 Ubuntu or the Ubuntu server. How is this possible? (I may be wrong, but I have this feeling that Windows 10 is able to leverage these benefits out of the Intel core but Ubuntu is unable to).

Because FinEtools places an upper bound on LoopVectorization at ā€œ0.6ā€. Without this upper bound, the package manager can select 0.8.20.

Could you provide versioninfo() for all three?
As this was the same system as the Windows benchmarks, what were the relative timings between them?

Also, you could try

function calcNext2!(Anew, A)
    error = zero(eltype(A))
    m, n = size(A)
    @avx unroll=(1,1) for j = 2:(m-1)
        for i = 2:(n-1)
            Anew[i,j] = 0.25*( A[i+1, j ]
                             + A[i-1, j ]
                             + A[ i ,j-1]
                             + A[ i ,j+1])
            error = max(error, abs(Anew[i,j] - A[i,j]))
        end
    end
    return error
end

to prevent unrolling the loops. This seems to help performance slightly.
It increases the number of instructions needed to evaluate the loop by about 30%, but it decreases the number of L1d misses, which in turn increases the instructions/clock.

What I said earlier about the @pstats macro (which you can try using LinuxPerf on the Ubuntu server (assuming /proc/sys/kernel/perf_event_paranoid is <= 1, or that you have permission to sudo sh -c "echo 1 > /proc/sys/kernel/perf_event_paranoid") is that when you use @avx, the CPU has to do a small fraction of the amount of the amount of work as without it. But that the CPU then goes from being very busy, to spending most of its time idle, not doing anything while waiting for memory to arrive.

I suspect that the hardware prefetcher doesnā€™t know what to do with the memory access pattern when we donā€™t unroll, and either gives up or (worse) prefetches data we donā€™t need. But Iā€™m no expert when it comes to this.

1 Like

Of course, I forgot that when I do ā€œupdateā€ it will not tell me that there is a package update
that is possible.

This is why I originally considered the idea of forcing all releases to have upper bounds to be controversial (since there is no good communication for updates). A possible aid with this is to subscribe to the releases of dependencies on GitHub.

WSL-2 Ubuntu 20.04 in Windows 10:

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PKG_SERVER = pkg.juliacomputing.com
  JULIA_DEPOT_PATH = /home/sambit98/.juliapro/JuliaPro_v1.4.2-1:/mnt/d/SAMBIT/JULIA/JuliaInstallation/Jn/JuliaPro-1.4.2-1/Julia/local/share/julia:/mnt/d/SAMBIT/JULIA/JuliaInstallation/JuliaPro-1.4.2-1/Ju/shlia/share/julia

Ubuntu 16.04 Server:

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Windows 10:

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)      
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\Ashok Kumar Mishra\AppData\Local\atom\app-1.49.0\atom.exe"  -a
  JULIA_NUM_THREADS = 4

The following are the times taken. I ran the code 2-3 times and chose the time taken by the last execution in each of the cases. I ran Ubuntu Julia in Terminal and Windows 10 Julia in Juno, and tried to keep all factors as consistent as possible.

Using @inbounds @fastmath
WSL-2 Ubuntu 20.04 : 38.616142 seconds (674.78 k allocations: 31.788 MiB, 0.07% gc time)
Server Ubuntu 16.04 : 34.896664 seconds (674.78 k allocations: 31.786 MiB, 0.05% gc time)
Windows 10 : 39.097357 seconds (672.78 k allocations: 31.679 MiB)

Using @avx
WSL-2 Ubuntu 20.04 : 42.330296 seconds (14.95 M allocations: 742.403 MiB, 0.48% gc time)
Server Ubuntu 16.04 : 35.750165 seconds (13.52 M allocations: 660.662 MiB, 0.34% gc time)
Windows 10 : 38.948415 seconds (13.47 M allocations: 658.412 MiB, 0.71% gc time)

I realised that I incorrectly came to the conclusion that using @avx shows a significant difference in Windows 10 alone, since I was using Revise to change the function content and execute (which had given me the 25% speedup). So in the end, based on these executions, @avx is slightly decreasing the speed.

Using @avx unroll(1,1)
WSL-2 Ubuntu 20.04: 39.749676 seconds (14.23 M allocations: 706.557 MiB, 0.47% gc time)
Server Ubuntu 16.04: 30.470257 seconds (12.75 M allocations: 622.198 MiB, 0.40% gc time)
Windows 10: 36.784031 seconds (12.71 M allocations: 619.726 MiB, 0.73% gc time)

So the last suggestion is the best, although itā€™s still slower than the Fortran code which takes 28 seconds in the Server Ubuntu.

Based on https://www.youtube.com/watch?v=xPhnJCAkI4k I think these issues will not be there by v1.5. I may be wrong since I didnā€™t understand much from the video as a noob.

Thatā€™s a lot of allocations. Are your matrices non-const globals? Also, if you run to convergence, you may expect a runtime difference due to different number of iterations between implementations.

Matrices are not globals. I have initialised them and used them as shown below.

push!(LOAD_PATH,".");
using laplace2d

const n = 4096;
const m = 4096;

A,Anew = initialize(m,n);
println("Jacobi relaxation Calculation: ",n,"x",m," mesh");
@time Iterate!(A,Anew,m,n);

Wherein, I used the following function:

    function initialize(m,n)
        A      = zeros(Float64,n,m);
        A[1,:] =  ones(Float64,n  );
        Anew = deepcopy(A);
        return(A,Anew)
    end

A fixed number of 1000 iterations is executed all the time.

Although I have a feeling that the definition of arrays can be modified to my needs, I donā€™t know how I should do that (or if there is any scope of improvement in that direction).

And, why are there so many more allocations with the @avx versions than the @inbounds @fastmath versions?

Is this counting counting compilation?

As for initialize, do you really need Anew to be a copy of A? In the calcNext function, youā€™re overwriting its contents without reading. You also donā€™t need to make a vector of ones:

function initialize(m,n)
    A      = zeros(Float64,n,m);
    A[1,:] .=  1
    Anew = similar(A);
    return(A,Anew)
end
2 Likes

I did not pay much attention to initializing since the bottleneck of the code was calc! and copy! functions. I am comparing my results with Fortran code. The error in that code is calculated by considering the edges of the domain too, which are not modified throughout the simulation. Hence, using similar is not preferred here. I have incorporated the other changes.

I am not sure about this, can you tell me how to find out? I ran 2 simulations consecutively with using Revise and got the following results.
RUN 1: 36.588347 seconds (12.55 M allocations: 612.136 MiB, 0.32% gc time)
RUN 2: 30.291741 seconds (174 allocations: 8.188 KiB)

Also it seems like this is getting more involved, so I have included the entire code below. I have jacobi.jl as the main program and laplace2d.jl code written like a module.

jacobi.jl

push!(LOAD_PATH,".");

using laplace2d

const n = 4096;
const m = 4096;

A,Anew = initialize2(m,n);

println("Jacobi relaxation Calculation: ",n,"x",m," mesh");

@time Iterate!(A,Anew,m,n);

laplace2d.jl

module laplace2d

using LoopVectorization

export initialize
export Iterate!

    function initialize(m,n)

        A      = zeros(Float64,n,m);
        A[1,:] .=  1

        Anew = deepcopy(A);

        return(A,Anew)
    end

    function calcNext!(Anew,A,m,n)
        println(Anew)
        error = 0.0::Float64;

        @avx unroll=(1,1) for j=2:(m-1)
            for i = 2:(n-1)

                 Anew[i,j] = 0.25*( A[i+1, j ]
                                  + A[i-1, j ]
                                  + A[ i ,j-1]
                                  + A[ i ,j+1]    );

                error = max(error, abs( Anew[i,j] - A[i,j] ) );
            end
        end

        return(error);
    end

    function Iterate!(A,Anew,m,n)
        Error = 1.0::Float64;
        iter     = 0;

        tol   = 1.0e-6::Float64;
        iter_max = 10;
        while (Error>tol && iter<iter_max)

            Error = calcNext!(Anew,A,m,n);
            copyto!(A,Anew);

            if(mod(iter,100) ā‰ˆ 0); println(iter,"\t",Error);  end

            iter += 1;
        end
    end
end

Output from Julia code:

Jacobi relaxation Calculation: 4096x4096 mesh
0       0.25
100     0.002397062376472081
200     0.0012043203299441085
300     0.000803706300640139
400     0.0006034607481846255
500     0.0004830019181236156
600     0.00040251372422966947
700     0.00034514687472469996
800     0.0003021170179446919
900     0.0002685514561578395
 30.395864 seconds (12.75 M allocations: 622.198 MiB, 0.40% gc time)

Output from Fortran code:

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0     0.25000000000000000000
  100     0.00239706248976290226
  200     0.00120432034600526094
  300     0.00080370629439130425
  400     0.00060346076497808099
  500     0.00048300193157047033
  600     0.00040251371683552861
  700     0.00034514686558395624
  800     0.00030211702687665820
  900     0.00026855146279558539
 completed in     28.006 seconds

With 4096x4096 matrices, the memory problems are going to be even worse than when they were 512x512.

Another optimization is to drop the copies in the loop:

function Iterate!(A,Anew,m,n)
    Error = 1.0::Float64;
    iter     = 0;

    tol   = 1.0e-6::Float64;
    iter_max = 10;
    Aorig = A
    while (Error>tol && iter<iter_max)

        Error = calcNext!(Anew,A,m,n);
        Anew, A = A, Anew

        if(mod(iter,100) ā‰ˆ 0); println(iter,"\t",Error);  end

        iter += 1;
    end
    A === Aorig || copyto!(Aorig, A)
    nothing
end

Why to get rid of any unnecessary copyto!s:

julia> A = rand(4096, 4096); Anew = similar(A);

julia> @btime calcNext3!($Anew, $A)
  18.138 ms (0 allocations: 0 bytes)
 0.9740572038986998

julia> @btime copyto!($Anew, $A);
  13.360 ms (0 allocations: 0 bytes)

Theyā€™re almost as slow as calcNext3! (the @avx unroll=(1,1) version).

FWIW, I get:

julia> A,Anew = initialize(m,n);

julia> @time Iterate!(A,Anew,m,n);
0       0.25
100     0.002397062376472081
200     0.0012043203299441085
300     0.000803706300640139
400     0.0006034607481846255
500     0.0004830019181236156
600     0.00040251372422966947
700     0.00034514687472469996
800     0.0003021170179446919
900     0.0002685514561578395
 18.105712 seconds (213 allocations: 9.109 KiB)

This was with iter_max = 1_000.

2 Likes

@Elrod that is just amazing !! I didnā€™t expect to even go this far! Thanks a lot for supporting me till here! Now the Julia code is 40% faster than the Fortran code!

I have a doubt regarding running the code twice. The following is what I got.


julia> include("jacobi.jl")
[ Info: Precompiling laplace2d [top-level]
Jacobi relaxation Calculation: 4096x4096 mesh
0       0.25
100     0.002397062376472081
200     0.0012043203299441085
300     0.000803706300640139
400     0.0006034607481846255
500     0.0004830019181236156
600     0.00040251372422966947
700     0.00034514687472469996
800     0.0003021170179446919
900     0.0002685514561578395
 24.432541 seconds (12.70 M allocations: 619.979 MiB, 0.99% gc time)

julia> include("jacobi.jl")
Jacobi relaxation Calculation: 4096x4096 mesh
0       0.25
100     0.002397062376472081
200     0.0012043203299441085
300     0.000803706300640139
400     0.0006034607481846255
500     0.0004830019181236156
600     0.00040251372422966947
700     0.00034514687472469996
800     0.0003021170179446919
900     0.0002685514561578395
 17.869136 seconds (173 allocations: 7.859 KiB)

The number of allocations is less only when I run the code more than once. I am not clear on why that is so (you had asked if its counting compilation before this, I didnā€™t get that as well). Why is there so much difference between the two executions? Also, does that mean if I somehow compile the code and run it as an executable, I would get performance akin to the second execution?

1 Like

The first time you use @avx, you need to not only compile that function, but LoopVectorization as well. This will take on the order of 5+ seconds.

Yes.

1 Like

I created another code JacobiSysImage.jl based on PackageCompiler documentation and video.

Code for JacobiSysImage.jl

using PackageCompiler
create_sysimage(:LoopVectorization,sysimage_path="jacobi_sys_image.so",precompile_execution_file="jacobi.jl") 

Then I got a two-line terminal execution of running the main code using compiled packages with the help of a stackoverflow discussion.

$ julia JacobiSysImage.jl
$ julia -J jacobi_sys_image.so -q -e 'include("jacobi.jl")' 

I decreased the execution time further this way, and now the code is around 50% faster than the compiled Fortran code. I should also mention that PackageCompiler took a few minutes to create jacobi_sys_image.so.

The whole idea behind this was to execute the main code using two lines of terminal script, similar to what was done in Fortran.

Although there is an option to create an executable app in PackageCompiler, I did not find that easy enough to execute or worth the effort (correct me if I am wrong).

Also, is there no way to get allocations minimal with the first execution of the code? I am still getting a large number of allocations when I execute the code (although the performance is similar to when I execute the code the second time).
15.010896 seconds (1.29 M allocations: 64.140 MiB, 0.03% gc time)

I believe the functions you defined in the script are still being recompiled, but the library dependencies (in particular, LoopVectorization) donā€™t have to be recompiled, making this much faster. Thereā€™s a risk of some of LoopVectorization needing to get recompiled anyway because itā€™s currently passing gensymed variables as type parameters. Iā€™ll address that in a future release.

If you create an app, none of it should have to recompile.