OpenBLAS: Julia slower than R

Dear, I am not a programmer for Julia and so most likely my doubt is due to the little knowledge I have of Julia.

On a computer, I tried to produce similar codes from the Julia and R languages and got a lower performance on Julia. The codes are below:

Code Julia:

M = rand(Float64, 5000, 5000);

function inversa_loop(matriz, n)
     for i = 1:n
        inv(matriz)
    end
end

@time inversa_loop(M, 10)

Time: 65.243 sec

Code R:

M <- matrix(runif(n = 5000^2, 0, 1), 5000, 5000)

inversa_loop <- function(matriz, n = 10){
  for (i in 1:n){
    solve(matriz)
  }
}

system.time(inversa_loop(M))

Time: 30.231 sec

Later I tried to improve the Julia code by declaring the variable types to the inversa_loop function to see if this would help improve performance. As we can see, yes, there was an improvement in performance considering the code below:

Modified Julia code:

M = rand(Float64, 5000, 5000);
 function inversa_loop(matriz::Array{Int64,2}, n::Integer)
     for i = 1:n
	 	inv(matriz)
	 end
 end
 
@time inversa_loop(M, 10)

Time: 43.654 sec

Then I noticed that maybe Iā€™m doing an unfair comparison, since the solve function of R is implemented in C or Fortran and the inv is implemented in Julia. In this way, I can understand that purely Julia codes are generally more efficient than purely written R codes.

However, why common functions of linear algebra in Julia were not implemented using C/C ++? Would not that be a way to eliminate those cases where we have R functions running more efficiently than purely written codes in Julia? Perhaps there is a technical reason that justifies and I do not know. Maybe itā€™s just a matter of philosophy of language. Perhaps the reason is the belief that in the near future the LLVM machine will evolve to the point of eliminating those differences.

I keep thinking about the case of a language user who will repeat hundreds of thousands of times Julia codes where in these repetitions exist algebraic operations that in languages like R are implemented efficiently using languages such as C or C ++. I also know that I can call C / C ++ codes in Julia, but that would be something undesirable and cumbersome for something that is so simple in other languages since they already make use of efficient codes , such as the solve function in R.

Best regards.

1 Like

You are timing the inversion of a 5000x5000 matrix ā€” this O(nĀ³) operation completely dominates every other operation ā€” so the benchmark has nothing to do with language. It depends entirely on which linear-algebra library is linked and how many threads it is using.

Thatā€™s not correct. For double-precision calculations (Float64), Julia, R, Numpy, Matlab, etcetera all do matrix inversion with LAPACK, but there are different optimized versions of LAPACK (or the underlying BLAS library) that can be linked by all of these. Sometimes people link OpenBLAS, sometimes they link MKL, or sometimes another library ā€” both Julia and R can be linked with various BLAS libraries with different tradeoffs. In your benchmark, you are mostly timing which BLAS library was linked. Also, different libraries use different numbers of threads by default.

In Julia, you can find out the BLAS library you are using via:

using LinearAlgebra
BLAS.vendor()

Iā€™m not sure what the easiest way to determine this is for R.

Basically, if you are doing computations that are dominated by dense-matrix operations like solving Ax=b or Ax=Ī»x, it doesnā€™t really matter what language you are using because every language uses LAPACK and can be configured link to the same BLAS libraries. But if you do enough computational work, eventually you are going to run into a problem where you have to write your own ā€œinner loopsā€ (performance-critical code), and thatā€™s where language choice matters. To do a useful cross-language benchmark, you need to be timing inner-loop code, not calls to external libraries.

21 Likes

Julia and R are using OpenBlas, in that case. I understand. I linkei R to use OpenBLAS as is the case of Julia here on my machine. Therefore, both languages are making use of OpenBLAS.

Also note that your type annotation is for an Int64 array, which means your narrower function definition wonā€™t be called with a Float64 input array. Type annotations on function arguments arenā€™t necessary for performance in most cases.

Julia:

julia> using LinearAlgebra

julia> BLAS.vendor()
:openblas

R

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libope

Maybe they are using different numbers of threads, then? You can control how many threads Juliaā€™s BLAS uses via BLAS.set_num_threads(n); not sure how to do it in R.

(They are also probably linked to different builds/versions of OpenBLAS, but Iā€™m not sure that should matter so much here.)

In linux, I can control using the operating system itself by export OPENBLAS_NUM_THREADS=$nproc. In both cases I am making the same use of the number of threads.

(Type annotations on function arguments have zero impact on performance in any case, not just most cases. The annotations are merely a ā€œfilterā€ to determine which methods get called for which arguments.)

1 Like

On the machine in question, I have only one thread per core.

Okay. So will there be another way to improve the code to the point of being more efficient or at least equivalent with the R code? Thanks.

This is generating an array whose entries are all 5000. The solve function therefore might be terminating early because the matrix is singular?

The analogue of Juliaā€™s rand(5000,5000) in R would be

M <- matrix(runif(5000^2), 5000)

no?

1 Like

Your R is linked with OpenBLAS in /opt/OpenBLAS, whereas Julia is probably linked with its own build of OpenBLAS. Is it possible that your library in /opt/OpenBLAS is compiled with NO_AFFINITY=0 (i.e. with thread affinity enabled)? Julia is compiled with NO_AFFINITY=1 sets OPENBLAS_MAIN_FREE to disable thread affinity IIRC, which could affect performance in a benchmark like this.

You could use taskset in Linux to set CPU affinity in Julia.

Note also that you should probably run @time inversa_loop(M, 10) a couple of times in Julia as the first execution is also measuring some JIT compilation overhead. (If you use @btime from the BenchmarkTools package then it does timing measurements more carefully for you.)

1 Like

If we want to be really accurate, this is not 100% true since type annotation does change some specialization heuristics:

etc.

Nothing to think about while everyday coding and definitely not something that has to be brought up to new users of Julia, but might be worth knowing at least.

6 Likes

That was a writing error only here. Sorry for the mistake. I will correct the above code. The time was computed considering M <- matrix(runif(n = 5000^2, 0, 1), 5000, 5000).

Iā€™ll have to check it out with more calmly.

Echoing what Kristoffer said, and I think this is something that actually affects beginners every once in a while when they forget to interpolate their variables when benchmarking. Just the other day I encountered some confusion around code like the below:

inc_array_1(a::Vector{Float64}) = a .+= 1
inc_array_2(a) = a .+= 1

a = zeros(16);
@btime inc_array_1(a); # 7.267 ns (0 allocations: 0 bytes)
@btime inc_array_2(a); # 17.712 ns (0 allocations: 0 bytes)

For comparison (with interpolation):

@btime inc_array_1($a); # 4.809 ns (0 allocations: 0 bytes)
@btime inc_array_2($a); # 4.797 ns (0 allocations: 0 bytes)
4 Likes

How many physical cores does your system have?
OpenBLAS with Julia is capped at 16 threads.

Watching top, I saw that R used >2000% CPU, while Julia used 800%. The number of physical CPU cores was 16, so I set BLAS.set_num_threads(16) which gave about a 20% performance increase.

R took about 10% longer than Julia using 8 threads.
export OPENBLAS_NUM_THREADS=16 and then launching R in the exact same terminal tab did not reduce the CPU use by R, so I do not think that worked.

Because of the cap on Juliaā€™s OpenBLAS, I could not actually test running both with the same number of threads.

Because practically 100% of the time spent is spent by OpenBLAS, Iā€™d expect equal performance if both were actually using the same number of threads.

I could edit that file and recompile for the sake of testing this, but given that my expectation is it will run slower (when set to using that many threads), Iā€™m not exactly excited.

1 Like

In my case, I have 4 physical cores, each of them supporting only one thread. Details of my CPU is below:


[pedro-de pedro]# lscpu
Arquitetura:                x86_64
Modo(s) operacional da CPU: 32-bit, 64-bit
Ordem dos bytes:            Little Endian
Tamanhos de endereƧo:       39 bits physical, 48 bits virtual
CPU(s):                     4
Lista de CPU(s) on-line:    0-3
Thread(s) per nĆŗcleo:       1
NĆŗcleo(s) por soquete:      4
Soquete(s):                 1
NĆ³(s) de NUMA:              1
ID de fornecedor:           GenuineIntel
FamĆ­lia da CPU:             6
Modelo:                     60
Nome do modelo:             Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz
Step:                       3
CPU MHz:                    798.238
CPU MHz mƔx.:               3700,0000
CPU MHz mĆ­n.:               800,0000
BogoMIPS:                   5988.80
VirtualizaĆ§Ć£o:              VT-x
cache de L1d:               32K
cache de L1i:               32K
cache de L2:                256K
cache de L3:                6144K
CPU(s) de nĆ³0 NUMA:         0-3
OpƧƵes:                     fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l1d

export OPENBLAS_NUM_THREADS=4 makes me have less performance than consider export OPENBLAS_NUM_THREADS=1. I understand that export OPENBLAS_NUM_THREADS refers to the number of threads per core and not the number of physical colors. That way, I think the correct thing would be to consider export OPENBLAS_NUM_THREADS=1. Do not you agree?

I will try to compile the OpenBLAS library of R in the form of OpenBLAS used by Julia. So maybe I can do a fairer comparison. Best, I will try to link the R literally with OpenBLAS used by Julia in the directory below:

[pedro-de julia]# pwd
/usr/lib64/julia
[pedro-de julia]# ls
libamd.so    libccalltest.so        libcolamd.so  liblapack.so        libLLVM.so        libopenlibm.so      libpcre2-8.so.0          libsuitesparse_wrapper.so
libblas.so   libccalltest.so.debug  libdSFMT.so   libLLVM-6.0.1.so    libmpfr.so        libopenlibm.so.2    libpcre2-8.so.0.6.0      libumfpack.so
libcamd.so   libccolamd.so          libgit2.so    libLLVM-6.0.so      libmpfr.so.6      libopenlibm.so.2.5  libspqr.so               sys.so
libcblas.so  libcholmod.so          libgmp.so     libllvmcalltest.so  libmpfr.so.6.0.1  libpcre2-8.so       libsuitesparseconfig.so

I will report the result soon.

2 Likes

I created a symbolic link of the libblas.so file, used by Julia in the /usr/lib64/julia directory for the libopenblas_haswellp-r0.3.6.dev.so file used by R and found in the /opt/OpenBLAS/lib directory.

It seems to me that now i am using the openblas library that is used by R. Well, still I noticed that the code Julia was 1/3 slower than the code R. I think that the basic functions of linear algebra and numerical optimization per Nelder-Mead, BFGS, L-BFGS B and Simulated Annealing should be implemented in languages such as C/C++ and thus could be used, without the need for major optimizations by users of Julia.

Does anyone who considers themselves an advanced Julia user succeed in producing a Julia code, similar to the one above, in that the inv function of Julia is more efficient than the solve function of R? Well, Iā€™ll keep trying.

Note: To those who only come to read this comment in isolation, please read the previous ones. I understand that pure R codes tend to be slower than pure Julia codes. I also understand that it may be unfair to compare R functions that use C with Julia functions that use only Julia.