OpenBLAS: Julia slower than R

I did a tutorial on how to exchange BLAS for an optimized version of BLAS, for example, OpenBLAS for performing linear algebra. In R, yes, the link exchange works perfectly. If on your machine you have MKL, OpenBLAS or some implementation of BLAS, following these procedures that I documented it will be possible to make the exchange, in R.

See: Using OpenBLAS in R.

I think in your case, the code in Julia was more efficient because Julia is using OpenBLAS and R is using BLAS. Usually a default installation of R makes use of BLAS.

However, if you change the BLAS of R to OpenBLAS, you will most likely see better performance of the code in R.

In both cases (Julia and R), from the comparisons I made, the complicated OpenBLAS library on a 64-bit architecture is considered.

No, both Julia and R are using OpenBLAS in my case.

R’s reference BLAS is not multithreaded, for one thing.
Its performance is rather abysmal.

EDIT:

$ ldd /usr/lib/R/bin/exec/R | grep blas
	libblas.so.3 => /usr/lib/libblas.so.3 (0x00007f91054df000)
$ readlink /usr/lib/libblas.so.3
libopenblasp-r0.3.5.so

In R, run sessionInfo(). Return the output.

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Antergos Linux

Matrix products: default
BLAS: /usr/lib/libopenblasp-r0.3.5.so
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.3

EDIT: I also have another computer with:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-generic-linux-gnu (64-bit)
Running under: Clear Linux OS

Matrix products: default
BLAS/LAPACK: /usr/lib64/haswell/avx512_1/libopenblas_skylakexp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2

But last time I tried that R benchmark, the computer crashed. =/
I may have to fiddle with the overclock settings. So I’m not exactly sure how they will perform relative to one another on this system. I’ll report on that later.

Do you mind expanding on why the latter set of results is faster? I am a beginner and even after looking at what I think is the relevant section of the documentation, I don’t understand why the second set would be different.

1 Like

With the benchmarktools macros, when you don’t interpolate, all the arguments are treated like global variables. When interpolating, they’re treated as though they’re local.

Functions operating on non-constant global variables have to do dynamic dispatches.

julia> using BenchmarkTools

julia> inc_array_1(a::Vector{Float64}) = a .+= 1
inc_array_1 (generic function with 1 method)

julia> inc_array_2(a) = a .+= 1
inc_array_2 (generic function with 1 method)

julia> a = zeros(32);

julia> const consta = a;

julia> @btime inc_array_1(a);
  5.015 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2(a);
  10.762 ns (0 allocations: 0 bytes)

julia> @btime inc_array_1($a);
  3.890 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2($a);
  3.890 ns (0 allocations: 0 bytes)

julia> @btime inc_array_1(consta);
  3.890 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2(consta);
  3.758 ns (0 allocations: 0 bytes)

julia> @btime inc_array_1($consta);
  3.690 ns (0 allocations: 0 bytes)

julia> @btime inc_array_2($consta);
  3.748 ns (0 allocations: 0 bytes)
2 Likes

I believe that in your case the solve () function is using LAPACK and not OpenBLAS. In my case I have:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.6.dev.so

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=pt_BR.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.2

When you have free time, try to compile OpenBLAS on your machine. From what I saw I used the Antergos. So the Arch tutorial will work for your GNU/Linux distribution.

See: Compiling OpenBLAS and Linking with R

Best regards.

Instructions for compiling R, OpenBLAS and linking R with OpenBLAS (GNU/Linux)

DEPENDENCES: make, cmake, gcc, gcc-fortran and tk.

Compiling OpenBLAS

Initially download the R and OpenBLAS (Open Optimized BLAS Library) source codes in OpenBLAS. In the file directory, perform the following steps.

tar -zxvf OpenBLAS*
cd OpenBLAs*
make -j $nproc
sudo make install
export LD_LIBRARY_PATH=/opt/OpenBLAS/lib/

or

git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS*
make -j $nproc
sudo make install
export LD_LIBRARY_PATH=/opt/OpenBLAS/lib/

Note: This will make the compilation run faster using all the features of your CPU. To know the number of cores, do: nproc.

Compiling R with OpenBLAS

After compiling OpenBLAS, download the R code. It is not necessary to compile R to make use of OpenBLAS, but compiling the language may bring some benefits that may be insignificant depending on what is being done in R. That way, download the source code of the language R.

Note: In my operating system, Arch Linux, OpenBLAS was installed in the /opt directory. Search for the OpenBLAS installation directory in your GNU/Linux distribution.

In the directory where the R was downloaded, do the following:

tar -zxvf R*
cd R-* && ./configure --enable-R-shlib --enable-threads=posix --with-blas="-lopenblas -L/opt/OpenBLAS/lib -I/opt/OpenBLAS/include -m64 -lpthread -lm"
make -j $nproc
sudo make install

Most likely the OpenBLAS library will be bound to R. To check, run in the R the sessionInfo() code. Something like the output below should appear:

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.6.dev.so

If linking does not occur, follow the steps outlined in the code below.

We need to link the R with the file libopenblas_*, created in the process of compiling the library OpenBLAS. In my case, the file is ibopenblas_haswellp-r0.2.20.so. Look for this in /opt/OpenBLAS/lib or in the directory where OpenBLAS was installed on your GNU/Linux system. Also look for the libRblas.so file directory found in the R language installation directory. In Arch, this directory is /usr/local/lib64/R/lib.

cd /usr/local/lib64/R/lib
mv libRblas.so libRblas.so.keep
ln -s /opt/OpenBLAS/lib/libopenblas_haswellp-r0.2.20.so libRblas.so

Start a section of language R and do sessionInfo(). You should note something like:

Matrix products: default
BLAS/LAPACK: /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.6.dev.so

This — the default Julia OpenBLAS build is ABI-incompatible with R, so it is impossible to simply move files around to make R use Julia’s BLAS or vice versa.

In particular, even on a 64-bit build, R uses the standard BLAS interface, which uses the 32-bit int type for sizes, which limits you to linear algebra on vectors/matrices with no more than 2³¹–1 elements per dimension. (There is a bigalgebra package for R to work around this limitation.)

In contrast, the OpenBLAS that Julia links to is specially built so that it uses 64-bit integers on 64-bit machines. To avoid accidentally linking this 64-bit BLAS to other libraries that expect a 32-bit interface, Julia’s OpenBLAS renames the symbols in the BLAS and LAPACK libraries. (You can double-check your Julia build by looking at LinearAlgebra.BlasInt, which is Int64 if Julia was built with the 64-bit BLAS interface.)

Because of this, it is impossible for R and the default Julia to share a BLAS. You may think that this is what you did by moving links around, but you are wrong.

The only way to link Julia and R to the same BLAS is to recompile Julia to use a 32-bit BLAS interface (pass USE_BLAS64=0 to make IIRC), and then possibly you need to recompile R as well to make sure it points to the new library.

12 Likes

As the comment preceding this one explained, that doesn’t work. But stevengj gives suggestions on how to use the same BLAS libraries.

As for why my LAPACK was slower with R than Julia, I already suggested a simple explanation:
I was testing on a 16 core / 32 thread computer. OpenBLAS through Julia used 16 threads, while R showed at > 2000% CPU. I am (again) rather confidant that my R is using OpenBLAS for LAPACK.

export OPENBLAS_NUM_THREADS=16 from the command line didn’t work, but I just tried this suggestion:

require(inline)
openblas.set.num.threads <- cfunction( signature(ipt="integer"),
      body = 'openblas_set_num_threads(*ipt);',
      otherdefs = c ('extern void openblas_set_num_threads(int);'),
      libargs = c ('-L/opt/openblas/lib -lopenblas'),
      language = "C",
      convention = ".C"
)
openblas.set.num.threads(16)
> system.time(inversa_loop(M))
   user  system elapsed 
185.036   7.342  18.111

versus what I had with 32 threads:

> system.time(inversa_loop(M))
   user  system elapsed 
453.946  14.618  22.254

In Julia, after setting 16 threads (while the default was 8), I get < 17 seconds, so Julia is still faster. I am skeptical of the suggestion that my R may not be using OpenBLAS for LAPACK.

Meanwhile, my other computer is still crashing when trying to take the inverse of a 5000 x 5000 matrix, in either R or Julia.

export OPENBLAS_NUM_THREADS = 16 does not seem correct since export OPENBLAS_NUM_THREADS should be passed the number of threads per core and not the total number of threads. For example, in my case I have:

[pedro@pedro-avell ~]$ lscpu
Arquitetura:                x86_64
Modo(s) operacional da CPU: 32-bit, 64-bit
Ordem dos bytes:            Little Endian
Tamanhos de endereço:       39 bits physical, 48 bits virtual
CPU(s):                     8
Lista de CPU(s) on-line:    0-7
Thread(s) per núcleo:       2
Núcleo(s) por soquete:      4
Soquete(s):                 1
Nó(s) de NUMA:              1
ID de fornecedor:           GenuineIntel
Família da CPU:             6
Modelo:                     60
Nome do modelo:             Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
Step:                       3
CPU MHz:                    1596.519
CPU MHz máx.:               3500,0000
CPU MHz mín.:               800,0000
BogoMIPS:                   4990.74
Virtualização:              VT-x
cache de L1d:               32K
cache de L1i:               32K
cache de L2:                256K
cache de L3:                6144K
CPU(s) de nó0 NUMA:         0-7
Opções:                     fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l1d

So, in my case I have 1 thread per core. It would not be correct to do export OPENBLAS_NUM_THREADS=8.

See: https://github.com/JuliaLang/julia/issues/3256

See: https://github.com/xianyi/OpenBLAS/issues/192

For this case, I might consider export OPENBLAS_NUM_THREADS=2. However, it is advisable to consider export OPENBLAS_NUM_THREADS=1.

That did not work either. It still used 32 threads on the 16 core/32 thread system. Restricting to 16 threads (1 per core) gives the best BLAS performance in my tests.

Running the code showed earlier was able to restrict R to 16 threads.

Here was a tutorial. This has been modified and is further ahead.

1 Like

Julia comes with OpenBLAS by default so perhaps you want to elaborate a bit here?

1 Like

I preferred to compile the OpenBLAS library separately since I want this library to be used by other applications. Also, what matters is doing OPENBLAS_DYNAMIC_ARCH=0 and
MARCH=native to solve the problem. For Arch Linux users and derived distributions, simply do:

yaourt -S openblas-lapack --noconfirm 
sudo pacman -S julia
1 Like

Nothing prevents your operating system from having the compiled OpenBLAS library. This way it is possible and normal for the user who wants to compile the language not to compile the OpenBLAS library again. In systems maintenance, we often have a well-done compilation of OpenBLAS and/or derivatives.

For tests, for example, sometimes we want to test and we may want to see the behavior of the language with several versions of OpenBLAS. I think this is very common.

I think you want make -j $(nproc); compare

$ echo $nproc

$ echo $(nproc)
20
1 Like

Yep. adjusted. Thanks.

As you might know, Julia is not really good at handling “global variables”, that is why we interpolate when benchmarking a function that uses a global variable as input. The difference between

@btime inc_array_1(a)

and

@btime inc_array_1($a)

is that in the latter you are telling the code to handle a as a global variable, to avoid timing the penalty of calling and using a global variable. In any case, just interpolate input variables when benchmarking.

1 Like

Dear Colleagues, for anyone interested in linking the Julia language with some version of OpenBLAS compiled and installed separately from Julia, consider following the tutorial below. I still can not explain why consider just installing Julia and letting Julia automatically install OpenBLAS is 3 to 4 times slower on my machine. However, as I prefer to consider a separate installation of OpenBLAS, since I use it for several other applications, my problem is solved.

Instructions for compiling Julia language with OpenBLAS (GNU/Linux)

IMPORTANT : Be sure to remove all versions of BLAS , LAPACK and OpenBLAS installed in /usr/lib and /usr/lib64 .

DEPENDENCES:

  • git
  • make
  • cmake
  • gcc
  • gcc-fortran
  • patch

Important: I’ll be at all times assuming that the project Julia has been cloned into the directory $HOME/Downloads. Also, I will consider the /opt directory as the installation directory for the OpenBLAS library and of the Julia language. You can choose a directory of your choice.

Compiling OpenBLAS

Initially download the Julia and OpenBLAS (Open Optimized BLAS Library) source codes in OpenBLAS. In the file directory, perform the following steps.

Note:

  1. Simply invoking make (or gmake on BSD) will detect the CPU automatically. To set a specific target CPU, use make TARGET=xxx, e.g. make TARGET=HASWELL. The full target list is in the file TargetList.txt.
  2. This will make the compilation run faster using all the features of your CPU. To know the number of cores, do: nproc. The default installation directory is /opt/OpenBLAS.
cd $HOME/Downloads
tar -zxvf OpenBLAS*
cd OpenBLAs*
make -j $(nproc) 
sudo make install PREFIX=/opt/OpenBLAS

or

cd $HOME/Downloads
git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS*
git checkout v0.3.5
make -j $(nproc) 
sudo make install PREFIX=/opt/OpenBLAS

Modifying the directory /opt/OpenBLAS (Soft Links)

In order for the Julia language compilation to proceed correctly with the link to the OpenBLASlibrary installed in the /opt/ directory, we have to create some soft links. The /opt/OpenBLAS/lib directory and symbolic links should be of the following form:

Some of the soft links had already been created with the library installation OpenBLAS. To create the remaining symbolic links, do the following:

cd /opt/OpenBLAS/lib
sudo ln -sf libopenblas_haswellp-r0.3.5.so libblas.so
sudo ln -sf libopenblas_haswellp-r0.3.5.so libcblas.so
sudo ln -sf libopenblas_haswellp-r0.3.5.so liblapack.so
sudo cp -a lib* /usr/lib64
sudo cp -a lib* /usr/lib

Note:

1 - Make sure that there is no installed version of blas, lapack and OpenBLAS in /usr. The command cp -a lib * will copy the compiled OpenBLAS library files along with the symbolic links created so they can be used throughout the system.

2 - Note that the libopenblas_haswellp-r0.3.5.so file may have a different name on your machine because of the version of OpenBLAS and computer architecture. Usually it has a name in the form libopenblas_xxx. If this is the case, make the necessary change in file name.

Cloning the Julia Project

Initially do the Julia project clone on GitHub. That way, with git installed and configured, do:

cd $HOME/Downloads && git clone git://github.com/JuliaLang/julia.git
cd julia

After downloading all the project files Julia cloned to the computer, go to the version you want to compile, for example the version v1.1.0. To know the versions, list all the tags of the language versions of the cloned project (git tag -l).

git checkout v1.1.0

Creating the Make.user file

Subsequently, have the OpenBLAS library in the /opt/OpenBLAS/lib/ directory be added to the environment variable LD_LIBRARY_PATH. In Linux, the LD_LIBRARY_PATH environment variable is a set of colon-separated directories where libraries should be searched first, before the default set of directories. This will cause the Julia compilation to consider the OpenBLAS library of the /opt/OpenBLAS/lib/. In the cloned directory, create the Make.user file with the following content:

cd $HOME/Downloads/julia
USE_SYSTEM_XXX=1
MARCH=native
LDFLAGS=-Wl,-rpath,/usr/lib64
LDFLAGS+=-Wl,-rpath,/usr/lib
LDFLAGS+=-Wl,-rpath,/opt/OpenBLAS/lib
OPENBLAS_DYNAMIC_ARCH=0
USE_SYSTEM_BLAS=1
USE_SYSTEM_LAPACK=1

or

cd $HOME/Downloads/julia
echo "USE_SYSTEM_XXX=1
MARCH=native
LDFLAGS=-Wl,-rpath,/usr/lib64
LDFLAGS+=-Wl,-rpath,/usr/lib
LDFLAGS+=-Wl,-rpath,/opt/OpenBLAS/lib
OPENBLAS_DYNAMIC_ARCH=0
USE_SYSTEM_BLAS=1
USE_SYSTEM_LAPACK=1" > Make.user

Note: Other paths of libraries of interest can be added to the Make.user file by doing LDFLAGS+=-Wl,-rpath,/path/of/library.

Now, under the cloned directory of Julia, under the version of interest, compile the language doing:

cd $HOME/Downloads/julia
make -j $(nproc)

echo "prefix=/opt/julia" >> Make.user
cd /opt 
sudo mkdir julia 
cd $HOME/Downloads/julia
sudo make install
sudo ln -sf /opt/julia/bin/julia /usr/bin
julia

Note: In my tests, the procedure of compiling OpenBLAS separately and later compiling the Julia language provided greater computational efficiency.

This tutorial has been deposited in: https://github.com/prdm0/compiling_julia/blob/master/README.md.

Best regards.

7 Likes