OpenBLAS: Julia slower than R

pedro.rafae · March 7, 2019, 7:18pm

Dear, I am not a programmer for Julia and so most likely my doubt is due to the little knowledge I have of Julia.

On a computer, I tried to produce similar codes from the Julia and R languages and got a lower performance on Julia. The codes are below:

Code Julia:

M = rand(Float64, 5000, 5000);

function inversa_loop(matriz, n)
     for i = 1:n
        inv(matriz)
    end
end

@time inversa_loop(M, 10)

Time: 65.243 sec

Code R:

M <- matrix(runif(n = 5000^2, 0, 1), 5000, 5000)

inversa_loop <- function(matriz, n = 10){
  for (i in 1:n){
    solve(matriz)
  }
}

system.time(inversa_loop(M))

Time: 30.231 sec

Later I tried to improve the Julia code by declaring the variable types to the inversa_loop function to see if this would help improve performance. As we can see, yes, there was an improvement in performance considering the code below:

Modified Julia code:

M = rand(Float64, 5000, 5000);
 function inversa_loop(matriz::Array{Int64,2}, n::Integer)
     for i = 1:n
	 	inv(matriz)
	 end
 end
 
@time inversa_loop(M, 10)

Time: 43.654 sec

Then I noticed that maybe I’m doing an unfair comparison, since the solve function of R is implemented in C or Fortran and the inv is implemented in Julia. In this way, I can understand that purely Julia codes are generally more efficient than purely written R codes.

However, why common functions of linear algebra in Julia were not implemented using C/C ++? Would not that be a way to eliminate those cases where we have R functions running more efficiently than purely written codes in Julia? Perhaps there is a technical reason that justifies and I do not know. Maybe it’s just a matter of philosophy of language. Perhaps the reason is the belief that in the near future the LLVM machine will evolve to the point of eliminating those differences.

I keep thinking about the case of a language user who will repeat hundreds of thousands of times Julia codes where in these repetitions exist algebraic operations that in languages like R are implemented efficiently using languages such as C or C ++. I also know that I can call C / C ++ codes in Julia, but that would be something undesirable and cumbersome for something that is so simple in other languages since they already make use of efficient codes , such as the solve function in R.

Best regards.

stevengj · March 7, 2019, 7:30pm

You are timing the inversion of a 5000x5000 matrix — this O(n³) operation completely dominates every other operation — so the benchmark has nothing to do with language. It depends entirely on which linear-algebra library is linked and how many threads it is using.

That’s not correct. For double-precision calculations (Float64), Julia, R, Numpy, Matlab, etcetera all do matrix inversion with LAPACK, but there are different optimized versions of LAPACK (or the underlying BLAS library) that can be linked by all of these. Sometimes people link OpenBLAS, sometimes they link MKL, or sometimes another library — both Julia and R can be linked with various BLAS libraries with different tradeoffs. In your benchmark, you are mostly timing which BLAS library was linked. Also, different libraries use different numbers of threads by default.

In Julia, you can find out the BLAS library you are using via:

using LinearAlgebra
BLAS.vendor()

I’m not sure what the easiest way to determine this is for R.

Basically, if you are doing computations that are dominated by dense-matrix operations like solving Ax=b or Ax=λx, it doesn’t really matter what language you are using because every language uses LAPACK and can be configured link to the same BLAS libraries. But if you do enough computational work, eventually you are going to run into a problem where you have to write your own “inner loops” (performance-critical code), and that’s where language choice matters. To do a useful cross-language benchmark, you need to be timing inner-loop code, not calls to external libraries.

pedro.rafae · March 7, 2019, 7:33pm

Julia and R are using OpenBlas, in that case. I understand. I linkei R to use OpenBLAS as is the case of Julia here on my machine. Therefore, both languages are making use of OpenBLAS.

stillyslalom · March 7, 2019, 7:34pm

Also note that your type annotation is for an Int64 array, which means your narrower function definition won’t be called with a Float64 input array. Type annotations on function arguments aren’t necessary for performance in most cases.

pedro.rafae · March 7, 2019, 7:34pm

Julia:

julia> using LinearAlgebra

julia> BLAS.vendor()
:openblas

R

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libope

stevengj · March 7, 2019, 7:35pm

Maybe they are using different numbers of threads, then? You can control how many threads Julia’s BLAS uses via BLAS.set_num_threads(n); not sure how to do it in R.

(They are also probably linked to different builds/versions of OpenBLAS, but I’m not sure that should matter so much here.)

pedro.rafae · March 7, 2019, 7:38pm

In linux, I can control using the operating system itself by export OPENBLAS_NUM_THREADS=$nproc. In both cases I am making the same use of the number of threads.

stevengj · March 7, 2019, 7:38pm

(Type annotations on function arguments have zero impact on performance in any case, not just most cases. The annotations are merely a “filter” to determine which methods get called for which arguments.)

pedro.rafae · March 7, 2019, 7:39pm

On the machine in question, I have only one thread per core.

pedro.rafae · March 7, 2019, 7:42pm

Okay. So will there be another way to improve the code to the point of being more efficient or at least equivalent with the R code? Thanks.

stevengj · March 7, 2019, 7:48pm

This is generating an array whose entries are all 5000. The solve function therefore might be terminating early because the matrix is singular?

The analogue of Julia’s rand(5000,5000) in R would be

M <- matrix(runif(5000^2), 5000)

no?

stevengj · March 7, 2019, 8:01pm

Your R is linked with OpenBLAS in /opt/OpenBLAS, whereas Julia is probably linked with its own build of OpenBLAS. Is it possible that your library in /opt/OpenBLAS is compiled with NO_AFFINITY=0 (i.e. with thread affinity enabled)? Julia ~~is compiled with NO_AFFINITY=1~~ sets OPENBLAS_MAIN_FREE to disable thread affinity IIRC, which could affect performance in a benchmark like this.

You could use taskset in Linux to set CPU affinity in Julia.

Note also that you should probably run @time inversa_loop(M, 10) a couple of times in Julia as the first execution is also measuring some JIT compilation overhead. (If you use @btime from the BenchmarkTools package then it does timing measurements more carefully for you.)

kristoffer.carlsson · March 7, 2019, 8:07pm

If we want to be really accurate, this is not 100% true since type annotation does change some specialization heuristics:

github.com/JuliaCI/BenchmarkTools.jl

force specialization on arguments to core wrapper

JuliaCI:master ← JuliaCI:kc/force_spec_in_core

opened 02:26AM - 18 Oct 18 UTC

KristofferC

+5 -1

Sometimes defining a function as `f(a, b, c) = g(a, b, c)` might mean that the s…pecialization heuristic kicks in when the type of an argument is a `Function` or `Type`. This leads to issues like https://github.com/JuliaLang/julia/issues/28087 and https://github.com/JuliaCI/BenchmarkTools.jl/issues/71. One possible way around this is to force specialization using `where`. This might make sense since presumably, we want this core wrapper to be "as invisible as possible". Fixes https://github.com/JuliaCI/BenchmarkTools.jl/issues/71. ``` julia> b1 = @benchmarkable rand($Float64); tune!(b1); run(b1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 5.333 ns (0.00% GC) median time: 6.078 ns (0.00% GC) mean time: 6.379 ns (0.00% GC) maximum time: 23.800 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ```

github.com/JuliaLang/julia

Force specialization on the type argument of `_similar_for`

JuliaLang:master ← JuliaLang:mh/specialize_similar_for

opened 12:58PM - 10 Dec 18 UTC

martinholters

+9 -6

Looking into the `["misc", "iterators", "zip(1:1, 1:1, 1:1, 1:1)"] ` regression …reported in https://github.com/JuliaLang/julia/pull/30218#issuecomment-443746637, it looks like #28284 pushed the `HasShape` method of `_similar_for` beyond the inlining threshold for that case, causing the regression. Adding a `@_meta_inline` would directly combat that, but forcing specialization on the type argument is less invasive and still restores performance: ```julia julia> for N in (1,1000), M in (1,2,3,4) @btime collect(zip($(Iterators.repeated(1:N, M)...,)...)) end # master 30.213 ns (1 allocation: 96 bytes) 32.706 ns (1 allocation: 96 bytes) 38.512 ns (1 allocation: 112 bytes) 130.325 ns (5 allocations: 272 bytes) 1.069 μs (1 allocation: 7.94 KiB) 1.753 μs (1 allocation: 15.75 KiB) 2.367 μs (2 allocations: 23.52 KiB) 3.187 μs (6 allocations: 31.48 KiB) # this PR 30.425 ns (1 allocation: 96 bytes) 32.565 ns (1 allocation: 96 bytes) 37.667 ns (1 allocation: 112 bytes) 42.188 ns (1 allocation: 112 bytes) 1.048 μs (1 allocation: 7.94 KiB) 1.781 μs (1 allocation: 15.75 KiB) 2.359 μs (2 allocations: 23.52 KiB) 2.837 μs (2 allocations: 31.33 KiB) ``` I hope @nanosoldier `runbenchmarks(ALL, vs = ":master")` agrees. In that case, I guess we want this on 1.1 as well. I vaguely remember having noticed this when working on #29238 but decided it was an orthogonal change to be done in a separate PR and then forgot about it...

github.com/JuliaLang/julia

specialization on function argument not occuring as expected

opened 10:12PM - 27 Oct 16 UTC

closed 11:16AM - 23 Jun 20 UTC

Sacha0

doc

``` julia julia> VERSION v"0.6.0-dev.1079" julia> notricks(f, x) = broadcast(f,… x) notricks (generic function with 1 method) julia> forcespec{TF}(f::TF, x) = broadcast(f, x) forcespec (generic function with 1 method) julia> x = rand(10^2); julia> using BenchmarkTools julia> @benchmark notricks($abs, $x) BenchmarkTools.Trial: samples: 10000 evals/sample: 1 time tolerance: 5.00% memory tolerance: 1.00% memory estimate: 1.91 kb allocs estimate: 28 minimum time: 19.00 μs (0.00% GC) median time: 20.26 μs (0.00% GC) mean time: 22.96 μs (0.00% GC) maximum time: 295.82 μs (0.00% GC) julia> @benchmark forcespec($abs, $x) BenchmarkTools.Trial: samples: 10000 evals/sample: 849 time tolerance: 5.00% memory tolerance: 1.00% memory estimate: 912.00 bytes allocs estimate: 2 minimum time: 146.00 ns (0.00% GC) median time: 157.00 ns (0.00% GC) mean time: 201.21 ns (14.08% GC) maximum time: 1.63 μs (82.76% GC) ``` See https://github.com/JuliaLang/julia/pull/18975#discussion_r84537865 for the original discussion and ref. #19065. Best!

etc.

Nothing to think about while everyday coding and definitely not something that has to be brought up to new users of Julia, but might be worth knowing at least.

pedro.rafae · March 7, 2019, 8:12pm

That was a writing error only here. Sorry for the mistake. I will correct the above code. The time was computed considering M <- matrix(runif(n = 5000^2, 0, 1), 5000, 5000).

pedro.rafae · March 7, 2019, 8:16pm

I’ll have to check it out with more calmly.

bennedich · March 8, 2019, 5:06am

Echoing what Kristoffer said, and I think this is something that actually affects beginners every once in a while when they forget to interpolate their variables when benchmarking. Just the other day I encountered some confusion around code like the below:

inc_array_1(a::Vector{Float64}) = a .+= 1
inc_array_2(a) = a .+= 1

a = zeros(16);
@btime inc_array_1(a); # 7.267 ns (0 allocations: 0 bytes)
@btime inc_array_2(a); # 17.712 ns (0 allocations: 0 bytes)

For comparison (with interpolation):

@btime inc_array_1($a); # 4.809 ns (0 allocations: 0 bytes)
@btime inc_array_2($a); # 4.797 ns (0 allocations: 0 bytes)

Elrod · March 8, 2019, 6:30am

How many physical cores does your system have?
OpenBLAS with Julia is capped at 16 threads.

Watching top, I saw that R used >2000% CPU, while Julia used 800%. The number of physical CPU cores was 16, so I set BLAS.set_num_threads(16) which gave about a 20% performance increase.

R took about 10% longer than Julia using 8 threads.
export OPENBLAS_NUM_THREADS=16 and then launching R in the exact same terminal tab did not reduce the CPU use by R, so I do not think that worked.

Because of the cap on Julia’s OpenBLAS, I could not actually test running both with the same number of threads.

Because practically 100% of the time spent is spent by OpenBLAS, I’d expect equal performance if both were actually using the same number of threads.

I could edit that file and recompile for the sake of testing this, but given that my expectation is it will run slower (when set to using that many threads), I’m not exactly excited.

pedro.rafae · March 8, 2019, 10:07am

Elrod:

How many physical cores does your system have?
OpenBLAS with Julia is capped at 16 threads .

Watching top , I saw that R used >2000% CPU, while Julia used 800%. The number of physical CPU cores was 16, so I set BLAS.set_num_threads(16) which gave about a 20% performance increase.

R took about 10% longer than Julia using 8 threads.
export OPENBLAS_NUM_THREADS=16 and then launching R in the exact same terminal tab did not reduce the CPU use by R, so I do not think that worked.

Because of the cap on Julia’s OpenBLAS, I could not actually test running both with the same number of threads.

Because practically 100% of the time spent is spent by OpenBLAS, I’d expect equal performance if both were actually using the same number of threads.

I could edit that file and recompile for the sake of testing this, but given that my expectation is it will run slower (when set to using that many threads), I’m not exactly excited.

In my case, I have 4 physical cores, each of them supporting only one thread. Details of my CPU is below:


[pedro-de pedro]# lscpu
Arquitetura:                x86_64
Modo(s) operacional da CPU: 32-bit, 64-bit
Ordem dos bytes:            Little Endian
Tamanhos de endereço:       39 bits physical, 48 bits virtual
CPU(s):                     4
Lista de CPU(s) on-line:    0-3
Thread(s) per núcleo:       1
Núcleo(s) por soquete:      4
Soquete(s):                 1
Nó(s) de NUMA:              1
ID de fornecedor:           GenuineIntel
Família da CPU:             6
Modelo:                     60
Nome do modelo:             Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz
Step:                       3
CPU MHz:                    798.238
CPU MHz máx.:               3700,0000
CPU MHz mín.:               800,0000
BogoMIPS:                   5988.80
Virtualização:              VT-x
cache de L1d:               32K
cache de L1i:               32K
cache de L2:                256K
cache de L3:                6144K
CPU(s) de nó0 NUMA:         0-3
Opções:                     fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l1d

export OPENBLAS_NUM_THREADS=4 makes me have less performance than consider export OPENBLAS_NUM_THREADS=1. I understand that export OPENBLAS_NUM_THREADS refers to the number of threads per core and not the number of physical colors. That way, I think the correct thing would be to consider export OPENBLAS_NUM_THREADS=1. Do not you agree?

pedro.rafae · March 8, 2019, 10:16am

I will try to compile the OpenBLAS library of R in the form of OpenBLAS used by Julia. So maybe I can do a fairer comparison. Best, I will try to link the R literally with OpenBLAS used by Julia in the directory below:

[pedro-de julia]# pwd
/usr/lib64/julia
[pedro-de julia]# ls
libamd.so    libccalltest.so        libcolamd.so  liblapack.so        libLLVM.so        libopenlibm.so      libpcre2-8.so.0          libsuitesparse_wrapper.so
libblas.so   libccalltest.so.debug  libdSFMT.so   libLLVM-6.0.1.so    libmpfr.so        libopenlibm.so.2    libpcre2-8.so.0.6.0      libumfpack.so
libcamd.so   libccolamd.so          libgit2.so    libLLVM-6.0.so      libmpfr.so.6      libopenlibm.so.2.5  libspqr.so               sys.so
libcblas.so  libcholmod.so          libgmp.so     libllvmcalltest.so  libmpfr.so.6.0.1  libpcre2-8.so       libsuitesparseconfig.so

I will report the result soon.

pedro.rafae · March 14, 2019, 1:29pm

I created a symbolic link of the libblas.so file, used by Julia in the /usr/lib64/julia directory for the libopenblas_haswellp-r0.3.6.dev.so file used by R and found in the /opt/OpenBLAS/lib directory.

It seems to me that now i am using the openblas library that is used by R. Well, still I noticed that the code Julia was 1/3 slower than the code R. I think that the basic functions of linear algebra and numerical optimization per Nelder-Mead, BFGS, L-BFGS B and Simulated Annealing should be implemented in languages such as C/C++ and thus could be used, without the need for major optimizations by users of Julia.

Does anyone who considers themselves an advanced Julia user succeed in producing a Julia code, similar to the one above, in that the inv function of Julia is more efficient than the solve function of R? Well, I’ll keep trying.

Note: To those who only come to read this comment in isolation, please read the previous ones. I understand that pure R codes tend to be slower than pure Julia codes. I also understand that it may be unfair to compare R functions that use C with Julia functions that use only Julia.

Topic		Replies	Views
LinearAlgebra.inv matrix inversion slower than MATLAB? Performance performance	21	2142	October 25, 2021
Performance inv & solve in Julia & R Performance	15	1422	August 11, 2020
Julia vs Fortran speed Performance fortran	5	1926	October 13, 2021
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1877	June 25, 2023
Julia matrix-multiplication performance Performance linearalgebra	20	8629	October 30, 2022

OpenBLAS: Julia slower than R

Related topics