QR decomposition in julia 2x slower than octave

question

#1

I noticed that QR decomposition is 2x slower in julia than octave. For example on my computer I get

julia> A = rand(5000, 5000); @time qr(A);
 27.075522 seconds (37 allocations: 765.687 MiB, 0.83% gc time)

and

>> A=rand(5000, 5000);
>> tic; qr(A); toc
Elapsed time is 8.35741 seconds.
>>

The results are more or less the same if repeated more times.

Anyone has an idea why is that? I thought both julia and octave use BLAS/Lapack, so both methods should take roughly the same amount of time.


#2

What does your versioninfo() say? Did you rebuild the system image if you used the generic binary?


#3

I suspect they may be doing different things. I don’t know what qr method Octave calls when no output arguments are provided, but my guess is that it’s just doing the 1st phase (computing the Householder reflectors and upper-triangular part) and not the 2nd (building Q). This would roughly match qrfact in Julia. On my machine, I get:

julia> @time qrfact(A);
  5.906331 seconds (25.99 k allocations: 194.912 MiB, 2.13% gc time)

julia> @time qr(A);
 16.202900 seconds (93.73 k allocations: 770.718 MiB, 1.62% gc time)

#4
julia> A = rand(5000, 5000); @time qr(A);
 14.143329 seconds (294.06 k allocations: 781.888 MiB, 0.90% gc time)

julia> versioninfo()
Julia Version 0.7.0-DEV.4690
Commit 78c7d87369* (2018-03-23 22:25 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 1

julia> A = rand(5000, 5000); @time qrfact(A);
  5.583176 seconds (2.32 k allocations: 193.608 MiB, 1.85% gc time)

julia>

versus

>> A=rand(5000, 5000);
>> tic; qr(A); toc
Elapsed time is 4.189508 seconds.
>> version

ans =

    '9.4.0.813654 (R2018a)'

#5

That would explain the difference. If you call the function qr in octave with both return values [Q, R] = qr(A), it takes more time, since it has to generate Q

>> A = rand(5000); tic; [Q,R] = qr(A); toc
Elapsed time is 17.8421 seconds.
>> tic; qr(A); toc
Elapsed time is 9.22192 seconds.

#6

Not so for Matlab:

>> A=rand(5000, 5000);
>> tic; qr(A); toc
Elapsed time is 4.189508 seconds.
>> version

ans =

    '9.4.0.813654 (R2018a)'

>> tic; [q,r]=qr(A); toc
Elapsed time is 5.785541 seconds.

#7

MATLAB uses MKL. For an out of the box comparison there it would be interesting to test JuliaPro+MKL against MATLAB.


#8

Still, there is a big difference in JuliaPro (MKL):

julia> A = rand(5000, 5000); @time qr(A);
  6.561612 seconds (37 allocations: 765.687 MiB, 2.20% gc time)

julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: mkl_rt
  LAPACK: mkl_rt
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

and MATLAB:

>> tic, A=rand(5000, 5000); [Q,R] = qr(A); toc
Elapsed time is 2.760503 seconds.

#9

Have you tried using qrfact instead of qr?
In my case the performance of qrfact(A) is approx. 2.5X faster than qr(A), for A = rand(5000, 5000).


#10

I get similar results:


julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY HASWELL)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> using BenchmarkTools

julia> A = rand(5000, 5000);

julia> @btime qr($A)
  9.746 s (33 allocations: 765.69 MiB)

julia> @btime qrfact($A)
  4.266 s (17 allocations: 193.48 MiB)

Julia is running in Virtualbox. Native windows Matlab 2018a:

>> tic; A=rand(5000, 5000); [Q,R] = qr(A); toc
Elapsed time is 3.738918 seconds.

#11

Use the same computer. Timings mean nothing when the same computer isn’t used. VirtualBox is a free shared cloud resource: I would expect it to not have the best performance.


#12

It is the same computer. Virtualbox is running locally.


#13

Oh my bad. For some reason I read JuliaBox…


#14

Heh, that would indeed be a meaningless comparison :slight_smile:


#15

What about Julia(Pro) with MKL? MATLAB is MKL.


#16

I hadn’t realized that there’s a free version of JuliaPro… I’ll need to try it out. If nobody does it first, I’ll post my results in a few days.


#17

With MKL under Windows:

versioninfo()
Julia Version 0.6.2
Commit d386e40c17* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
  WORD_SIZE: 64
  BLAS: mkl_rt
  LAPACK: mkl_rt
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, westmere)

Julia:

using BenchmarkTools
@btime qr($A)
14.230 s (33 allocations: 765.69 MiB)
@btime qrfact($A)
5.645 s (17 allocations: 193.48 MiB)

Matlab:

tic; A=rand(5000, 5000); [Q,R] = qr(A); toc
Elapsed time is 8.418957 seconds.

#18

In my experience the QR routines benefit from tuning. In particular, for sizes like 5000x5000, hyperthreading is counterproductive and the default block size is too small. So setting BLAS.set_num_threads(nr_of_physical_cores) and calling the LAPACK routines to manually reset the block size (default is 36):

julia> A=rand(5000,5000);

julia> A1=copy(A); @time F=Base.LinAlg.LAPACK.geqrt!(A1,256);
  1.460697 seconds (14 allocations: 19.532 MiB)

julia> FF=Base.LinAlg.QRCompactWY(F...);

# this finishes computing Q:
julia> B=eye(A); @time Base.LinAlg.LAPACK.gemqrt!('L','N',FF.factors,FF.T,B);
  2.057016 seconds (7 allocations: 190.735 MiB, 0.56% gc time)

julia> @time x=qr(A); # for comparison
  7.118840 seconds (35 allocations: 765.687 MiB, 2.22% gc time)

julia> vecnorm(x[1]-B) # consistency check
5.419168097206542e-13

This is about as fast as Matlab on a comparable machine. (Aside: these times are from a quiet system; I saw wild variation when other programs were minimally active.)

Here I am using Julia 0.6.2 with OpenBLAS.