QR decomposition in julia 2x slower than octave

I noticed that QR decomposition is 2x slower in julia than octave. For example on my computer I get

julia> A = rand(5000, 5000); @time qr(A);
 27.075522 seconds (37 allocations: 765.687 MiB, 0.83% gc time)

and

>> A=rand(5000, 5000);
>> tic; qr(A); toc
Elapsed time is 8.35741 seconds.
>>

The results are more or less the same if repeated more times.

Anyone has an idea why is that? I thought both julia and octave use BLAS/Lapack, so both methods should take roughly the same amount of time.

1 Like

What does your versioninfo() say? Did you rebuild the system image if you used the generic binary?

I suspect they may be doing different things. I don’t know what qr method Octave calls when no output arguments are provided, but my guess is that it’s just doing the 1st phase (computing the Householder reflectors and upper-triangular part) and not the 2nd (building Q). This would roughly match qrfact in Julia. On my machine, I get:

julia> @time qrfact(A);
  5.906331 seconds (25.99 k allocations: 194.912 MiB, 2.13% gc time)

julia> @time qr(A);
 16.202900 seconds (93.73 k allocations: 770.718 MiB, 1.62% gc time)
2 Likes
julia> A = rand(5000, 5000); @time qr(A);
 14.143329 seconds (294.06 k allocations: 781.888 MiB, 0.90% gc time)

julia> versioninfo()
Julia Version 0.7.0-DEV.4690
Commit 78c7d87369* (2018-03-23 22:25 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 1

julia> A = rand(5000, 5000); @time qrfact(A);
  5.583176 seconds (2.32 k allocations: 193.608 MiB, 1.85% gc time)

julia>

versus

>> A=rand(5000, 5000);
>> tic; qr(A); toc
Elapsed time is 4.189508 seconds.
>> version

ans =

    '9.4.0.813654 (R2018a)'

That would explain the difference. If you call the function qr in octave with both return values [Q, R] = qr(A), it takes more time, since it has to generate Q

>> A = rand(5000); tic; [Q,R] = qr(A); toc
Elapsed time is 17.8421 seconds.
>> tic; qr(A); toc
Elapsed time is 9.22192 seconds.

Not so for Matlab:

>> A=rand(5000, 5000);
>> tic; qr(A); toc
Elapsed time is 4.189508 seconds.
>> version

ans =

    '9.4.0.813654 (R2018a)'

>> tic; [q,r]=qr(A); toc
Elapsed time is 5.785541 seconds.

MATLAB uses MKL. For an out of the box comparison there it would be interesting to test JuliaPro+MKL against MATLAB.

3 Likes

Still, there is a big difference in JuliaPro (MKL):

julia> A = rand(5000, 5000); @time qr(A);
  6.561612 seconds (37 allocations: 765.687 MiB, 2.20% gc time)

julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: mkl_rt
  LAPACK: mkl_rt
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

and MATLAB:

>> tic, A=rand(5000, 5000); [Q,R] = qr(A); toc
Elapsed time is 2.760503 seconds.

Have you tried using qrfact instead of qr?
In my case the performance of qrfact(A) is approx. 2.5X faster than qr(A), for A = rand(5000, 5000).

I get similar results:


julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY HASWELL)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> using BenchmarkTools

julia> A = rand(5000, 5000);

julia> @btime qr($A)
  9.746 s (33 allocations: 765.69 MiB)

julia> @btime qrfact($A)
  4.266 s (17 allocations: 193.48 MiB)

Julia is running in Virtualbox. Native windows Matlab 2018a:

>> tic; A=rand(5000, 5000); [Q,R] = qr(A); toc
Elapsed time is 3.738918 seconds.

Use the same computer. Timings mean nothing when the same computer isn’t used. VirtualBox is a free shared cloud resource: I would expect it to not have the best performance.

It is the same computer. Virtualbox is running locally.

1 Like

Oh my bad. For some reason I read JuliaBox…

Heh, that would indeed be a meaningless comparison :slight_smile:

What about Julia(Pro) with MKL? MATLAB is MKL.

I hadn’t realized that there’s a free version of JuliaPro… I’ll need to try it out. If nobody does it first, I’ll post my results in a few days.

With MKL under Windows:

versioninfo()
Julia Version 0.6.2
Commit d386e40c17* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
  WORD_SIZE: 64
  BLAS: mkl_rt
  LAPACK: mkl_rt
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, westmere)

Julia:

using BenchmarkTools
@btime qr($A)
14.230 s (33 allocations: 765.69 MiB)
@btime qrfact($A)
5.645 s (17 allocations: 193.48 MiB)

Matlab:

tic; A=rand(5000, 5000); [Q,R] = qr(A); toc
Elapsed time is 8.418957 seconds.

In my experience the QR routines benefit from tuning. In particular, for sizes like 5000x5000, hyperthreading is counterproductive and the default block size is too small. So setting BLAS.set_num_threads(nr_of_physical_cores) and calling the LAPACK routines to manually reset the block size (default is 36):

julia> A=rand(5000,5000);

julia> A1=copy(A); @time F=Base.LinAlg.LAPACK.geqrt!(A1,256);
  1.460697 seconds (14 allocations: 19.532 MiB)

julia> FF=Base.LinAlg.QRCompactWY(F...);

# this finishes computing Q:
julia> B=eye(A); @time Base.LinAlg.LAPACK.gemqrt!('L','N',FF.factors,FF.T,B);
  2.057016 seconds (7 allocations: 190.735 MiB, 0.56% gc time)

julia> @time x=qr(A); # for comparison
  7.118840 seconds (35 allocations: 765.687 MiB, 2.22% gc time)

julia> vecnorm(x[1]-B) # consistency check
5.419168097206542e-13

This is about as fast as Matlab on a comparable machine. (Aside: these times are from a quiet system; I saw wild variation when other programs were minimally active.)

Here I am using Julia 0.6.2 with OpenBLAS.

4 Likes