OpenBLAS: Julia slower than R

Let’s follow what inv does.

https://github.com/JuliaLang/julia/blob/95fcde9dda4261ebd9dd2eb694b46fc74e472ef6/stdlib/LinearAlgebra/src/dense.jl#L730

https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/lu.jl#L142

https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/lu.jl#L40

Ok, so first we create a copy of A and then compute the lu factorization via a call to LAPACK.getrf!(A)

And then:

https://github.com/JuliaLang/julia/blob/95fcde9dda4261ebd9dd2eb694b46fc74e472ef6/stdlib/LinearAlgebra/src/lu.jl#L411

we call LAPACK.getri! to get the inverse of an LU-factorization.

There is “no” julia code running here, it all get shelled out to lapack. If we profile:

julia> Profile.print()
813 ./task.jl:259; (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})()
 813 ...build/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:117; macro expansion
  813 ...build/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
   813 ./boot.jl:328; eval(::Module, ::Any)
    813 ...hare/julia/stdlib/v1.1/LinearAlgebra/src/dense.jl:732; inv(::Array{Float64,2})
     646 .../share/julia/stdlib/v1.1/LinearAlgebra/src/lu.jl:410; inv!
      646 ...re/julia/stdlib/v1.1/LinearAlgebra/src/lapack.jl:978; getri!(::Array{Float64,2}, ::Array{Int64,1})
     167 .../share/julia/stdlib/v1.1/LinearAlgebra/src/lu.jl:142; lu
      167 .../share/julia/stdlib/v1.1/LinearAlgebra/src/lu.jl:142; lu
       167 ...share/julia/stdlib/v1.1/LinearAlgebra/src/lu.jl:142; #lu#107
        89 ./array.jl:308; copy
        78 ./none:0; #lu!
         78 ...e/julia/stdlib/v1.1/LinearAlgebra/src/lapack.jl:550; #lu!#103(::Bool, ::Function, ::Array{Float64,2}, ::Val{true})

we can see that we are inside getri! (lapack) the majority of the time.

4 Likes

Does anyone who considers themselves an advanced Julia user succeed in producing a Julia code, similar to the one above, in that the inv function of Julia is more efficient than the solve function of R ? Well, I’ll keep trying.

I tried on two different computers. Julia was faster with your code on both, so I cannot reproduce your problem.
But, because lapack dominated the runtime, that is just a statement about the quircks of the linear algebra libraries, not of the languages.

Also, I thought R uses 32 bit BLAS while Julia uses 64. Does swapping sym links really work?

1 Like

I did a tutorial on how to exchange BLAS for an optimized version of BLAS, for example, OpenBLAS for performing linear algebra. In R, yes, the link exchange works perfectly. If on your machine you have MKL, OpenBLAS or some implementation of BLAS, following these procedures that I documented it will be possible to make the exchange, in R.

See: Using OpenBLAS in R.

I think in your case, the code in Julia was more efficient because Julia is using OpenBLAS and R is using BLAS. Usually a default installation of R makes use of BLAS.

However, if you change the BLAS of R to OpenBLAS, you will most likely see better performance of the code in R.

In both cases (Julia and R), from the comparisons I made, the complicated OpenBLAS library on a 64-bit architecture is considered.

No, both Julia and R are using OpenBLAS in my case.

R’s reference BLAS is not multithreaded, for one thing.
Its performance is rather abysmal.

EDIT:

$ ldd /usr/lib/R/bin/exec/R | grep blas
	libblas.so.3 => /usr/lib/libblas.so.3 (0x00007f91054df000)
$ readlink /usr/lib/libblas.so.3
libopenblasp-r0.3.5.so

In R, run sessionInfo(). Return the output.

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Antergos Linux

Matrix products: default
BLAS: /usr/lib/libopenblasp-r0.3.5.so
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.3

EDIT: I also have another computer with:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-generic-linux-gnu (64-bit)
Running under: Clear Linux OS

Matrix products: default
BLAS/LAPACK: /usr/lib64/haswell/avx512_1/libopenblas_skylakexp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2

But last time I tried that R benchmark, the computer crashed. =/
I may have to fiddle with the overclock settings. So I’m not exactly sure how they will perform relative to one another on this system. I’ll report on that later.

Do you mind expanding on why the latter set of results is faster? I am a beginner and even after looking at what I think is the relevant section of the documentation, I don’t understand why the second set would be different.

1 Like

With the benchmarktools macros, when you don’t interpolate, all the arguments are treated like global variables. When interpolating, they’re treated as though they’re local.

Functions operating on non-constant global variables have to do dynamic dispatches.

julia> using BenchmarkTools

julia> inc_array_1(a::Vector{Float64}) = a .+= 1
inc_array_1 (generic function with 1 method)

julia> inc_array_2(a) = a .+= 1
inc_array_2 (generic function with 1 method)

julia> a = zeros(32);

julia> const consta = a;

julia> @btime inc_array_1(a);
  5.015 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2(a);
  10.762 ns (0 allocations: 0 bytes)

julia> @btime inc_array_1($a);
  3.890 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2($a);
  3.890 ns (0 allocations: 0 bytes)

julia> @btime inc_array_1(consta);
  3.890 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2(consta);
  3.758 ns (0 allocations: 0 bytes)

julia> @btime inc_array_1($consta);
  3.690 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2($consta);
  3.748 ns (0 allocations: 0 bytes)
2 Likes

I believe that in your case the solve () function is using LAPACK and not OpenBLAS. In my case I have:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.6.dev.so

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=pt_BR.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2

When you have free time, try to compile OpenBLAS on your machine. From what I saw I used the Antergos. So the Arch tutorial will work for your GNU/Linux distribution.

See: Compiling OpenBLAS and Linking with R

Best regards.

Instructions for compiling R, OpenBLAS and linking R with OpenBLAS (GNU/Linux)

DEPENDENCES: make, cmake, gcc, gcc-fortran and tk.

Compiling OpenBLAS

Initially download the R and OpenBLAS (Open Optimized BLAS Library) source codes in OpenBLAS. In the file directory, perform the following steps.

tar -zxvf OpenBLAS*
cd OpenBLAs*
make -j $nproc
sudo make install
export LD_LIBRARY_PATH=/opt/OpenBLAS/lib/

or

git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS*
make -j $nproc
sudo make install
export LD_LIBRARY_PATH=/opt/OpenBLAS/lib/

Note: This will make the compilation run faster using all the features of your CPU. To know the number of cores, do: nproc.

Compiling R with OpenBLAS

After compiling OpenBLAS, download the R code. It is not necessary to compile R to make use of OpenBLAS, but compiling the language may bring some benefits that may be insignificant depending on what is being done in R. That way, download the source code of the language R.

Note: In my operating system, Arch Linux, OpenBLAS was installed in the /opt directory. Search for the OpenBLAS installation directory in your GNU/Linux distribution.

In the directory where the R was downloaded, do the following:

tar -zxvf R*
cd R-* && ./configure --enable-R-shlib --enable-threads=posix --with-blas="-lopenblas -L/opt/OpenBLAS/lib -I/opt/OpenBLAS/include -m64 -lpthread -lm"
make -j $nproc
sudo make install

Most likely the OpenBLAS library will be bound to R. To check, run in the R the sessionInfo() code. Something like the output below should appear:

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.6.dev.so

If linking does not occur, follow the steps outlined in the code below.

We need to link the R with the file libopenblas_*, created in the process of compiling the library OpenBLAS. In my case, the file is ibopenblas_haswellp-r0.2.20.so. Look for this in /opt/OpenBLAS/lib or in the directory where OpenBLAS was installed on your GNU/Linux system. Also look for the libRblas.so file directory found in the R language installation directory. In Arch, this directory is /usr/local/lib64/R/lib.

cd /usr/local/lib64/R/lib
mv libRblas.so libRblas.so.keep
ln -s /opt/OpenBLAS/lib/libopenblas_haswellp-r0.2.20.so libRblas.so

Start a section of language R and do sessionInfo(). You should note something like:

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.6.dev.so

This — the default Julia OpenBLAS build is ABI-incompatible with R, so it is impossible to simply move files around to make R use Julia’s BLAS or vice versa.

In particular, even on a 64-bit build, R uses the standard BLAS interface, which uses the 32-bit int type for sizes, which limits you to linear algebra on vectors/matrices with no more than 2³¹–1 elements per dimension. (There is a bigalgebra package for R to work around this limitation.)

In contrast, the OpenBLAS that Julia links to is specially built so that it uses 64-bit integers on 64-bit machines. To avoid accidentally linking this 64-bit BLAS to other libraries that expect a 32-bit interface, Julia’s OpenBLAS renames the symbols in the BLAS and LAPACK libraries. (You can double-check your Julia build by looking at LinearAlgebra.BlasInt, which is Int64 if Julia was built with the 64-bit BLAS interface.)

Because of this, it is impossible for R and the default Julia to share a BLAS. You may think that this is what you did by moving links around, but you are wrong.

The only way to link Julia and R to the same BLAS is to recompile Julia to use a 32-bit BLAS interface (pass USE_BLAS64=0 to make IIRC), and then possibly you need to recompile R as well to make sure it points to the new library.

12 Likes

As the comment preceding this one explained, that doesn’t work. But stevengj gives suggestions on how to use the same BLAS libraries.

As for why my LAPACK was slower with R than Julia, I already suggested a simple explanation:
I was testing on a 16 core / 32 thread computer. OpenBLAS through Julia used 16 threads, while R showed at > 2000% CPU. I am (again) rather confidant that my R is using OpenBLAS for LAPACK.

export OPENBLAS_NUM_THREADS=16 from the command line didn’t work, but I just tried this suggestion:

require(inline)
openblas.set.num.threads <- cfunction( signature(ipt="integer"),
      body = 'openblas_set_num_threads(*ipt);',
      otherdefs = c ('extern void openblas_set_num_threads(int);'),
      libargs = c ('-L/opt/openblas/lib -lopenblas'),
      language = "C",
      convention = ".C"
)
openblas.set.num.threads(16)
> system.time(inversa_loop(M))
   user  system elapsed 
185.036   7.342  18.111

versus what I had with 32 threads:

> system.time(inversa_loop(M))
   user  system elapsed 
453.946  14.618  22.254

In Julia, after setting 16 threads (while the default was 8), I get < 17 seconds, so Julia is still faster. I am skeptical of the suggestion that my R may not be using OpenBLAS for LAPACK.

Meanwhile, my other computer is still crashing when trying to take the inverse of a 5000 x 5000 matrix, in either R or Julia.

export OPENBLAS_NUM_THREADS = 16 does not seem correct since export OPENBLAS_NUM_THREADS should be passed the number of threads per core and not the total number of threads. For example, in my case I have:

[pedro@pedro-avell ~]$ lscpu
Arquitetura:                x86_64
Modo(s) operacional da CPU: 32-bit, 64-bit
Ordem dos bytes:            Little Endian
Tamanhos de endereço:       39 bits physical, 48 bits virtual
CPU(s):                     8
Lista de CPU(s) on-line:    0-7
Thread(s) per núcleo:       2
Núcleo(s) por soquete:      4
Soquete(s):                 1
Nó(s) de NUMA:              1
ID de fornecedor:           GenuineIntel
Família da CPU:             6
Modelo:                     60
Nome do modelo:             Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
Step:                       3
CPU MHz:                    1596.519
CPU MHz máx.:               3500,0000
CPU MHz mín.:               800,0000
BogoMIPS:                   4990.74
Virtualização:              VT-x
cache de L1d:               32K
cache de L1i:               32K
cache de L2:                256K
cache de L3:                6144K
CPU(s) de nó0 NUMA:         0-7
Opções:                     fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l1d

So, in my case I have 1 thread per core. It would not be correct to do export OPENBLAS_NUM_THREADS=8.

See: Matrix multiplications: stack overflow & crash! · Issue #3256 · JuliaLang/julia · GitHub

See: Lapack tests slower with OPENBLAS_NUM_THREADS>1 · Issue #192 · xianyi/OpenBLAS · GitHub

For this case, I might consider export OPENBLAS_NUM_THREADS=2. However, it is advisable to consider export OPENBLAS_NUM_THREADS=1.

That did not work either. It still used 32 threads on the 16 core/32 thread system. Restricting to 16 threads (1 per core) gives the best BLAS performance in my tests.

Running the code showed earlier was able to restrict R to 16 threads.

Here was a tutorial. This has been modified and is further ahead.

1 Like

Julia comes with OpenBLAS by default so perhaps you want to elaborate a bit here?

1 Like

I preferred to compile the OpenBLAS library separately since I want this library to be used by other applications. Also, what matters is doing OPENBLAS_DYNAMIC_ARCH=0 and
MARCH=native to solve the problem. For Arch Linux users and derived distributions, simply do:

yaourt -S openblas-lapack --noconfirm 
sudo pacman -S julia
1 Like

Nothing prevents your operating system from having the compiled OpenBLAS library. This way it is possible and normal for the user who wants to compile the language not to compile the OpenBLAS library again. In systems maintenance, we often have a well-done compilation of OpenBLAS and/or derivatives.

For tests, for example, sometimes we want to test and we may want to see the behavior of the language with several versions of OpenBLAS. I think this is very common.

I think you want make -j $(nproc); compare

$ echo $nproc

$ echo $(nproc)
20
1 Like

Yep. adjusted. Thanks.