Okay. tldr of this comment: I switched to an Intel computer, and still can’t reproduce. Julia is still about 50% faster than Numba, whether or not Julia is using libimf or mkl.
My previous benchmarks were on a Ryzen (AMD) chip.
While I did build Julia from source, Julia does not actually fully benefit: LLVM 3.9.1 does not support Ryzen in the sense that version info says
LLVM: libLLVM-3.9.1 (ORCJIT, generic)
instead of
LLVM: libLLVM-3.9.1 (ORCJIT, zenver1)
Additionally, on master, I get spammed with “zenver1 not recognized” warnings.
But, other than that, I’m not sure what the implications actually are. When I start julia with -O3
, StaticArrays
do get heavy SIMD (eg, SMatrix{3,3} * SVector{3) takes around 2 ns with -O3
, vs more around 6ns -O2
).
I’ll update these with actual copy + pastes when I get home (and set up ssh again…).
I tried setting LLVM_VER = 4.0.0
in Make.user, but it wouldn’t build. It just hangs on a patch. Maybe I can try going through the patches to see which are backports that are no longer necessary for 4.0.0…but unless someone tells me either “it’s easy, just do…” or that the benefit of having a supported processor is stupendous, I’ll just wait until Julia officially supports LLVM > 4.0.0.
Anyway, bringing up the processors again because I’m currently at my university computer, which is equipped with:
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
WORD_SIZE: 64
BLAS: libmkl_rt
LAPACK: libmkl_rt
LIBM: libimf
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
Julia was built from source, and linked to MKL.
I just downloaded and installed Anaconda. Figuring I can trust Intel when it comes to performance, I ran
conda config --add channels intel
before installing numba.
Results of running your scripts, julia with MKL:
$ julia RBC.jl
2018-02-13T11:19:29.693
2018-02-13T11:19:39.048
0.9355
$ julia RBC.jl
2018-02-13T11:19:42.185
2018-02-13T11:19:51.531
0.9346
$ julia RBC.jl
2018-02-13T11:19:56.862
2018-02-13T11:20:06.229
0.9367000000000001
$ julia RBC.jl
2018-02-13T11:20:11.23
2018-02-13T11:20:20.589
0.9359
$ julia -O3 RBC.jl
2018-02-13T11:20:41.661
2018-02-13T11:20:51.045
0.9384
$ julia -O3 RBC.jl
2018-02-13T11:20:58.442
2018-02-13T11:21:08.098
0.9656
$ julia -O3 RBC.jl
2018-02-13T11:21:15.534
2018-02-13T11:21:25.519
0.9985
$ julia RBC.jl
2018-02-13T11:21:42.455
2018-02-13T11:21:51.993
0.9538
$ julia RBC.jl
2018-02-13T11:22:12.554
2018-02-13T11:22:21.909
0.9355
I also have:
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
OpenBLAS and libopenlibm instead of mkl and libimf. The result?
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:58:45.152
2018-02-13T11:58:54.44
0.9288
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:00.121
2018-02-13T11:59:09.579
0.9458
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:21.31
2018-02-13T11:59:30.652
0.9342
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:35.405
2018-02-13T11:59:44.695
0.929
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:48.826
2018-02-13T11:59:58.182
0.9356
No major difference based on whether we’re using an entirely free stack or MKL.
Python with intel’s channel:
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:25:14.169929
2018-02-13 11:25:29.403137
1.5232970679178834
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:25:34.466773
2018-02-13 11:25:49.443639
1.4976639692205935
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:25:55.528475
2018-02-13 11:26:10.788838
1.5260133791249246
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:26:16.699890
2018-02-13 11:26:32.467552
1.5767432948108762
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:26:42.291484
2018-02-13 11:26:57.238478
1.494676438672468
Python without intel’s channel:
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:55:01.450319
2018-02-13 11:55:16.455309
1.5004762928001583
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:55:23.097814
2018-02-13 11:55:38.407803
1.5309765067882837
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:55:43.886989
2018-02-13 11:55:59.518951
1.5631739235017448
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:56:39.370372
2018-02-13 11:56:54.729220
1.5358620896935462
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:56:58.620716
2018-02-13 11:57:13.877899
1.5256956543307751
No major difference depending on whether or not we’re using Intel’s channel.
RBC_CPP, g++7.2
$ g++-7 -Ofast -march=native RBC_CPP.cpp -o rbc_cpp && ./rbc_cpp
My check = 0.146549
Elapsed time is = 1.32139
$ g++-7 -fprofile-generate -Ofast -march=native RBC_CPP.cpp -o rbc_cpp && ./rbc_cpp
My check = 0.146549
Elapsed time is = 1.2594
$ ./rbc_cpp
My check = 0.146549
Elapsed time is = 1.2647
$ g++-7 -fprofile-use -Ofast -march=native RBC_CPP.cpp -o rbc_cpp && ./rbc_cpp
My check = 0.146549
Elapsed time is = 1.23953
$ ./rbc_cpp
My check = 0.146549
Elapsed time is = 1.2377
RBC_CPP, icc 18.0.0
$ icc -xHost -Ofast RBC_CPP.cpp -o irbc_cpp && ./irbc_cpp
My check = 0.146549
Elapsed time is = 0.665514
$ icc -prof-gen -xHost -Ofast RBC_CPP.cpp -o irbc_cpp && ./irbc_cpp
My check = 0.146549
Elapsed time is = 0.727978
$ ./irbc_cpp
My check = 0.146549
Elapsed time is = 0.734366
$ icc -prof-use -xHost -Ofast RBC_CPP.cpp -o irbc_cpp && ./irbc_cpp
My check = 0.146549
Elapsed time is = 0.636588
$ ./irbc_cpp
My check = 0.146549
Elapsed time is = 0.641704
RBC_CPP_2, g++7.2
$ g++-7 -Ofast -march=native RBC_CPP_2.cpp -o rbc_cpp_2 && ./rbc_cpp_2
My check = 0.146549
Elapsed time is = 1.30672 seconds.
$ g++-7 -fprofile-generate -Ofast -march=native RBC_CPP_2.cpp -o rbc_cpp_2 && ./rbc_cpp_2
My check = 0.146549
Elapsed time is = 1.23079 seconds.
$ ./rbc_cpp_2
My check = 0.146549
Elapsed time is = 1.22811 seconds.
$ g++-7 -fprofile-use -Ofast -march=native RBC_CPP_2.cpp -o rbc_cpp_2 && ./rbc_cpp_2
My check = 0.146549
Elapsed time is = 1.30294 seconds.
$ ./rbc_cpp_2
My check = 0.146549
Elapsed time is = 1.29535 seconds.
RBC_CPP_2, icc 18.0.0
$ icc -xHost -Ofast -std=c++11 RBC_CPP_2.cpp -o irbc_cpp_2 && ./irbc_cpp_2
My check = 0.146549
Elapsed time is = 2.31978 seconds.
$ icc -prof-gen -xHost -Ofast -std=c++11 RBC_CPP_2.cpp -o irbc_cpp_2 && ./irbc_cpp_2
My check = 0.146549
Elapsed time is = 4.41832 seconds.
$ ./irbc_cpp_2
My check = 0.146549
Elapsed time is = 4.41458 seconds.
celrod@MMSGL50:~/Documents/programming
$ icc -prof-use -xHost -Ofast -std=c++11 RBC_CPP_2.cpp -o irbc_cpp_2 && ./irbc_cpp_2
My check = 0.146549
Elapsed time is = 2.20696 seconds.
$ ./irbc_cpp_2
My check = 0.146549
Elapsed time is = 2.21374 seconds.
RBC_F90, gfortran-7.2
$ gfortran-7 -Ofast -march=native RBC_F90.f90 -o rbc_f && ./rbc_f
My check: 0.14654914390886931
Elapsed time is 1.32013893
$ gfortran-7 -fprofile-generate -Ofast -march=native RBC_F90.f90 -o rbc_f && ./rbc_f
My check: 0.14654914390886931
Elapsed time is 1.32421708
$ ./rbc_f
My check: 0.14654914390886931
Elapsed time is 1.36759603
$ gfortran-7 -fprofile-use -Ofast -march=native RBC_F90.f90 -o rbc_f && ./rbc_f
My check: 0.14654914390886931
Elapsed time is 1.27381504
$ ./rbc_f
My check: 0.14654914390886931
Elapsed time is 1.29750502
RBC_F90, ifort 18.0.0
$ ifort -xHost -Ofast RBC_F90.f90 -o irbc_f && ./irbc_f
My check: 0.146549143908869
Elapsed time is 0.6529130
$ ifort -prof-gen -xHost -Ofast RBC_F90.f90 -o irbc_f && ./irbc_f
My check: 0.146549143908869
Elapsed time is 0.9747370
$ ./irbc_f
My check: 0.146549143908869
Elapsed time is 0.8385390
$ ifort -prof-use -xHost -Ofast RBC_F90.f90 -o irbc_f && ./irbc_f
My check: 0.146549143908869
Elapsed time is 0.6932410
$ ./irbc_f
My check: 0.146549143908869
Elapsed time is 0.6723280
Intel was the clear winner for C++98 and Fortran, but lagged far behind for C++11 for some reason.
A rough hierarchy of results on this computer:
Intel’s compilers (minus C++11) > Julia > gcc > Numba > Intel’s C++11
Julia was the same fast, whether with OpenBLAS + openlibm or MKL + libimf, and in both cases roughly 50% faster than Numba. This is consistent with my results on the AMD computer.
However, my 3.60GHz i7’s Numba was slower than your 2.40Ghz i5.
FWIW, all my tests used Anaconda for python3.