@stevengj I agree its fast enough for practical purposes. I just wanted to be sure that I have tried out all possible options and achieving the maximum speedup I can get. Its just for my practice with the language.
@PetrKryslUCSD This is a great link you directed me to! In fact, I was trying this with NVIDIA codes for C and Fortran.
Just curious: Why is the registered LoopVectorization stuck at 0.6.30?
I tried these improvements in three places: Windows 10 System, Ubuntu 20.04 WSL-2 in the same Windows 10 System and Ubuntu 16.04 Server. Sepcifically, I compared using @inbounds @fastmath
and using @avx
.
I was able to get improvements in Windows 10 as expected. My code ran 25% faster after using @avx
macro. However, surprisingly, I could not find any difference whatsoever in the WSL-2 Ubuntu or the Ubuntu server. How is this possible? (I may be wrong, but I have this feeling that Windows 10 is able to leverage these benefits out of the Intel core but Ubuntu is unable to).
Because FinEtools places an upper bound on LoopVectorization at ā0.6ā. Without this upper bound, the package manager can select 0.8.20.
Could you provide versioninfo()
for all three?
As this was the same system as the Windows benchmarks, what were the relative timings between them?
Also, you could try
function calcNext2!(Anew, A)
error = zero(eltype(A))
m, n = size(A)
@avx unroll=(1,1) for j = 2:(m-1)
for i = 2:(n-1)
Anew[i,j] = 0.25*( A[i+1, j ]
+ A[i-1, j ]
+ A[ i ,j-1]
+ A[ i ,j+1])
error = max(error, abs(Anew[i,j] - A[i,j]))
end
end
return error
end
to prevent unrolling the loops. This seems to help performance slightly.
It increases the number of instructions needed to evaluate the loop by about 30%, but it decreases the number of L1d misses, which in turn increases the instructions/clock.
What I said earlier about the @pstats
macro (which you can try using LinuxPerf on the Ubuntu server (assuming /proc/sys/kernel/perf_event_paranoid
is <= 1, or that you have permission to sudo sh -c "echo 1 > /proc/sys/kernel/perf_event_paranoid"
) is that when you use @avx
, the CPU has to do a small fraction of the amount of the amount of work as without it. But that the CPU then goes from being very busy, to spending most of its time idle, not doing anything while waiting for memory to arrive.
I suspect that the hardware prefetcher doesnāt know what to do with the memory access pattern when we donāt unroll, and either gives up or (worse) prefetches data we donāt need. But Iām no expert when it comes to this.
Of course, I forgot that when I do āupdateā it will not tell me that there is a package update
that is possible.
This is why I originally considered the idea of forcing all releases to have upper bounds to be controversial (since there is no good communication for updates). A possible aid with this is to subscribe to the releases of dependencies on GitHub.
WSL-2 Ubuntu 20.04 in Windows 10:
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_PKG_SERVER = pkg.juliacomputing.com
JULIA_DEPOT_PATH = /home/sambit98/.juliapro/JuliaPro_v1.4.2-1:/mnt/d/SAMBIT/JULIA/JuliaInstallation/Jn/JuliaPro-1.4.2-1/Julia/local/share/julia:/mnt/d/SAMBIT/JULIA/JuliaInstallation/JuliaPro-1.4.2-1/Ju/shlia/share/julia
Ubuntu 16.04 Server:
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Windows 10:
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = "C:\Users\Ashok Kumar Mishra\AppData\Local\atom\app-1.49.0\atom.exe" -a
JULIA_NUM_THREADS = 4
The following are the times taken. I ran the code 2-3 times and chose the time taken by the last execution in each of the cases. I ran Ubuntu Julia in Terminal and Windows 10 Julia in Juno, and tried to keep all factors as consistent as possible.
Using @inbounds @fastmath
WSL-2 Ubuntu 20.04 : 38.616142 seconds (674.78 k allocations: 31.788 MiB, 0.07% gc time)
Server Ubuntu 16.04 : 34.896664 seconds (674.78 k allocations: 31.786 MiB, 0.05% gc time)
Windows 10 : 39.097357 seconds (672.78 k allocations: 31.679 MiB)
Using @avx
WSL-2 Ubuntu 20.04 : 42.330296 seconds (14.95 M allocations: 742.403 MiB, 0.48% gc time)
Server Ubuntu 16.04 : 35.750165 seconds (13.52 M allocations: 660.662 MiB, 0.34% gc time)
Windows 10 : 38.948415 seconds (13.47 M allocations: 658.412 MiB, 0.71% gc time)
I realised that I incorrectly came to the conclusion that using @avx
shows a significant difference in Windows 10 alone, since I was using Revise to change the function content and execute (which had given me the 25% speedup). So in the end, based on these executions, @avx
is slightly decreasing the speed.
Using @avx unroll(1,1)
WSL-2 Ubuntu 20.04: 39.749676 seconds (14.23 M allocations: 706.557 MiB, 0.47% gc time)
Server Ubuntu 16.04: 30.470257 seconds (12.75 M allocations: 622.198 MiB, 0.40% gc time)
Windows 10: 36.784031 seconds (12.71 M allocations: 619.726 MiB, 0.73% gc time)
So the last suggestion is the best, although itās still slower than the Fortran code which takes 28 seconds in the Server Ubuntu.
Based on https://www.youtube.com/watch?v=xPhnJCAkI4k I think these issues will not be there by v1.5. I may be wrong since I didnāt understand much from the video as a noob.
Thatās a lot of allocations. Are your matrices non-const globals? Also, if you run to convergence, you may expect a runtime difference due to different number of iterations between implementations.
Matrices are not globals. I have initialised them and used them as shown below.
push!(LOAD_PATH,".");
using laplace2d
const n = 4096;
const m = 4096;
A,Anew = initialize(m,n);
println("Jacobi relaxation Calculation: ",n,"x",m," mesh");
@time Iterate!(A,Anew,m,n);
Wherein, I used the following function:
function initialize(m,n)
A = zeros(Float64,n,m);
A[1,:] = ones(Float64,n );
Anew = deepcopy(A);
return(A,Anew)
end
A fixed number of 1000 iterations is executed all the time.
Although I have a feeling that the definition of arrays can be modified to my needs, I donāt know how I should do that (or if there is any scope of improvement in that direction).
And, why are there so many more allocations with the @avx
versions than the @inbounds @fastmath
versions?
Is this counting counting compilation?
As for initialize
, do you really need Anew
to be a copy of A
? In the calcNext
function, youāre overwriting its contents without reading. You also donāt need to make a vector of ones:
function initialize(m,n)
A = zeros(Float64,n,m);
A[1,:] .= 1
Anew = similar(A);
return(A,Anew)
end
I did not pay much attention to initializing since the bottleneck of the code was calc!
and copy!
functions. I am comparing my results with Fortran code. The error in that code is calculated by considering the edges of the domain too, which are not modified throughout the simulation. Hence, using similar
is not preferred here. I have incorporated the other changes.
I am not sure about this, can you tell me how to find out? I ran 2 simulations consecutively with using Revise
and got the following results.
RUN 1: 36.588347 seconds (12.55 M allocations: 612.136 MiB, 0.32% gc time)
RUN 2: 30.291741 seconds (174 allocations: 8.188 KiB)
Also it seems like this is getting more involved, so I have included the entire code below. I have jacobi.jl as the main program and laplace2d.jl code written like a module.
jacobi.jl
push!(LOAD_PATH,".");
using laplace2d
const n = 4096;
const m = 4096;
A,Anew = initialize2(m,n);
println("Jacobi relaxation Calculation: ",n,"x",m," mesh");
@time Iterate!(A,Anew,m,n);
laplace2d.jl
module laplace2d
using LoopVectorization
export initialize
export Iterate!
function initialize(m,n)
A = zeros(Float64,n,m);
A[1,:] .= 1
Anew = deepcopy(A);
return(A,Anew)
end
function calcNext!(Anew,A,m,n)
println(Anew)
error = 0.0::Float64;
@avx unroll=(1,1) for j=2:(m-1)
for i = 2:(n-1)
Anew[i,j] = 0.25*( A[i+1, j ]
+ A[i-1, j ]
+ A[ i ,j-1]
+ A[ i ,j+1] );
error = max(error, abs( Anew[i,j] - A[i,j] ) );
end
end
return(error);
end
function Iterate!(A,Anew,m,n)
Error = 1.0::Float64;
iter = 0;
tol = 1.0e-6::Float64;
iter_max = 10;
while (Error>tol && iter<iter_max)
Error = calcNext!(Anew,A,m,n);
copyto!(A,Anew);
if(mod(iter,100) ā 0); println(iter,"\t",Error); end
iter += 1;
end
end
end
Output from Julia code:
Jacobi relaxation Calculation: 4096x4096 mesh
0 0.25
100 0.002397062376472081
200 0.0012043203299441085
300 0.000803706300640139
400 0.0006034607481846255
500 0.0004830019181236156
600 0.00040251372422966947
700 0.00034514687472469996
800 0.0003021170179446919
900 0.0002685514561578395
30.395864 seconds (12.75 M allocations: 622.198 MiB, 0.40% gc time)
Output from Fortran code:
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.25000000000000000000
100 0.00239706248976290226
200 0.00120432034600526094
300 0.00080370629439130425
400 0.00060346076497808099
500 0.00048300193157047033
600 0.00040251371683552861
700 0.00034514686558395624
800 0.00030211702687665820
900 0.00026855146279558539
completed in 28.006 seconds
With 4096x4096 matrices, the memory problems are going to be even worse than when they were 512x512.
Another optimization is to drop the copies in the loop:
function Iterate!(A,Anew,m,n)
Error = 1.0::Float64;
iter = 0;
tol = 1.0e-6::Float64;
iter_max = 10;
Aorig = A
while (Error>tol && iter<iter_max)
Error = calcNext!(Anew,A,m,n);
Anew, A = A, Anew
if(mod(iter,100) ā 0); println(iter,"\t",Error); end
iter += 1;
end
A === Aorig || copyto!(Aorig, A)
nothing
end
Why to get rid of any unnecessary copyto!
s:
julia> A = rand(4096, 4096); Anew = similar(A);
julia> @btime calcNext3!($Anew, $A)
18.138 ms (0 allocations: 0 bytes)
0.9740572038986998
julia> @btime copyto!($Anew, $A);
13.360 ms (0 allocations: 0 bytes)
Theyāre almost as slow as calcNext3!
(the @avx unroll=(1,1)
version).
FWIW, I get:
julia> A,Anew = initialize(m,n);
julia> @time Iterate!(A,Anew,m,n);
0 0.25
100 0.002397062376472081
200 0.0012043203299441085
300 0.000803706300640139
400 0.0006034607481846255
500 0.0004830019181236156
600 0.00040251372422966947
700 0.00034514687472469996
800 0.0003021170179446919
900 0.0002685514561578395
18.105712 seconds (213 allocations: 9.109 KiB)
This was with iter_max = 1_000
.
@Elrod that is just amazing !! I didnāt expect to even go this far! Thanks a lot for supporting me till here! Now the Julia code is 40% faster than the Fortran code!
I have a doubt regarding running the code twice. The following is what I got.
julia> include("jacobi.jl")
[ Info: Precompiling laplace2d [top-level]
Jacobi relaxation Calculation: 4096x4096 mesh
0 0.25
100 0.002397062376472081
200 0.0012043203299441085
300 0.000803706300640139
400 0.0006034607481846255
500 0.0004830019181236156
600 0.00040251372422966947
700 0.00034514687472469996
800 0.0003021170179446919
900 0.0002685514561578395
24.432541 seconds (12.70 M allocations: 619.979 MiB, 0.99% gc time)
julia> include("jacobi.jl")
Jacobi relaxation Calculation: 4096x4096 mesh
0 0.25
100 0.002397062376472081
200 0.0012043203299441085
300 0.000803706300640139
400 0.0006034607481846255
500 0.0004830019181236156
600 0.00040251372422966947
700 0.00034514687472469996
800 0.0003021170179446919
900 0.0002685514561578395
17.869136 seconds (173 allocations: 7.859 KiB)
The number of allocations is less only when I run the code more than once. I am not clear on why that is so (you had asked if its counting compilation before this, I didnāt get that as well). Why is there so much difference between the two executions? Also, does that mean if I somehow compile the code and run it as an executable, I would get performance akin to the second execution?
The first time you use @avx
, you need to not only compile that function, but LoopVectorization as well. This will take on the order of 5+ seconds.
Yes.
I created another code JacobiSysImage.jl
based on PackageCompiler documentation and video.
Code for JacobiSysImage.jl
using PackageCompiler
create_sysimage(:LoopVectorization,sysimage_path="jacobi_sys_image.so",precompile_execution_file="jacobi.jl")
Then I got a two-line terminal execution of running the main code using compiled packages with the help of a stackoverflow discussion.
$ julia JacobiSysImage.jl
$ julia -J jacobi_sys_image.so -q -e 'include("jacobi.jl")'
I decreased the execution time further this way, and now the code is around 50% faster than the compiled Fortran code. I should also mention that PackageCompiler
took a few minutes to create jacobi_sys_image.so
.
The whole idea behind this was to execute the main code using two lines of terminal script, similar to what was done in Fortran.
Although there is an option to create an executable app in PackageCompiler
, I did not find that easy enough to execute or worth the effort (correct me if I am wrong).
Also, is there no way to get allocations minimal with the first execution of the code? I am still getting a large number of allocations when I execute the code (although the performance is similar to when I execute the code the second time).
15.010896 seconds (1.29 M allocations: 64.140 MiB, 0.03% gc time)
I believe the functions you defined in the script are still being recompiled, but the library dependencies (in particular, LoopVectorization) donāt have to be recompiled, making this much faster. Thereās a risk of some of LoopVectorization needing to get recompiled anyway because itās currently passing gensym
ed variables as type parameters. Iāll address that in a future release.
If you create an app, none of it should have to recompile.