Pseudo-inverse of large matrix very slow

Hi, first time posting here. I need to compute the pseudo-inverse of large matrix (2504, 2502) as a part of an algorithm I am developing in Python (solving Ax = b). In Python, the pseudo-inverse is a clear bottleneck in terms of computation time, so I decided to look at Julia as an alternative, to check if it could improve the times.

However, when benchmarking with Python I am actually seeing slower times when using Julia. I read that LinearAlgebra uses openBLAS so I downloaded MKL.jl package, which enforces MKL usage instead of openBLAS, but even so, I seem to get slower computation times than with python/numpy.

Python:

import numpy as np
A = np.random.rand(2504, 2502)
%%timeit -r 5 -n 10
A_pinv = np.linalg.pinv(A)

leads to:

6.59 s Ā± 630 ms per loop (mean Ā± std. dev. of 5 runs, 10 loops each)

And Julia:

using LinearAlgebra
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 800  # so it runs all samples and evals
# BLAS.vendor() outputs :mkl
A = rand(2504, 2502)
@benchmark LinearAlgebra.pinv(x) setup=(x=$A) samples=10 evals=5

leads to:

BenchmarkTools.Trial: 
  memory estimate:  382.53 MiB
  allocs estimate:  29
  --------------
  minimum time:     7.398 s (0.61% GC)
  median time:      7.974 s (0.73% GC)
  mean time:        8.687 s (0.69% GC)
  maximum time:     10.871 s (0.65% GC)
  --------------
  samples:          10
  evals/sample:     5

I also ran this more than once to make sure it wasnā€™t including ā€œloading timeā€ for the benchmarking tools.
Any clues why this is so slow? Should I be using another package or function?

Probably itā€™s using a different number of threads?

You can call LinearAlgebra.BLAS.set_num_threads for OpenBLAS in Julia. I think with MKL you have to set the MKL_NUM_THREADS environment variable (before launching Julia)?

PS. If you are solving a least-square or similar non-invertible system, in most cases you should use pivoted QR rather than a pseudo-inverse, via A \ b ā€¦ or equivalently by QR = qr(A, Val{true}()) followed by QR \ b (if you want to re-use the factorization for multiple right-hand sides). See also this post. (If you have an underdetermined solve and want the minimum-norm solution, you can use the LQ factorization instead.)

Switching languages will not help you here. Everyone is using the same dense linear-algebra libraries (LAPACK with some backend, typically either OpenBLAS or MKL). The only way to improve your performance if you are limited by LAPACK is by changing your algorithm (e.g. to use QR rather than SVD, or doing something even more clever).

Julia helps when you need to write your own performance-critical inner-loop code.

11 Likes

Thanks for the reply!
Indeed, you are correct about using QR factorization to solve the system, it is much faster, thank you for the suggestion.

However, Iā€™m still confused why the Julia code is slower than numpy/python. In my Julia notebook, I tried setting

ENV["MKL_NUM_THREADS"] = 16
ENV["JULIA_NUM_THREADS"] = 16

And even set LinearAlgebra.BLAS.set_num_threads(16) after loading the packages to make sure it is set. However, Iā€™m not actually sure MKL is thread number is being affected by this. This is my versioninfo() after doing the above:

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-1030NG7 CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)
Environment:
  JULIA_NUM_THREADS = 16

However, when running the previous example again, I still get approximately the same results. Maybe this is not the correct way of setting the number of threads for MKL? I searched but could not find much.

1 Like

You have to set environment variables like JULIA_NUM_THREADS before launching Julia. (In an IJulia notebook, you do this by calling IJulia.installkernel to install a new kernel specification with the environment variables you want.) Also, note that the JULIA_NUM_THREADS is for native Julia threading, which is disjoint from OpenBLAS threading (for now).

See also the discussion here.

4 Likes

Thanks, seems to work now, although it is still around same speed as python/numpy.

Yes, I think this should be expected:

3 Likes

howeve. the speed of the same code in matlab is faster, about 10~100 times.

Did you try using MKL first?

no,I just use Distribution and give it 8 thread.Can you give me a good example which speed is faster or the same with matlabļ¼Ÿ

Which package is this? Could you post the code and the benchmark results?

Well then, do using MKL first on v1.7

1 Like

sorry, I misremembered. I didnā€™t use Distributions pkg or Distributed pkg.
my test code as fellowsļ¼š

# in cmd
set JULIA_NUM_THREADS=6
julia # start julia
using LinearAlgebra
const steps = 1000
@time Threads.@threads  for a in 1:steps
    A = randn(1000,1000)
    A = A[:,1:400]
    C = pinv(A)
end

or write as a function may improve the speed?

using LinearAlgebra
function pinvf2()
  Threads.@threads for a in 1:steps
        local A
        local B
        local C
        A = randn(1000,1000)
        A = A[:,1:400]
        C = pinv(A)
      end
      # return C
end

const steps = 1000
@time  pinvf2()

Can you give me a good example which speed is faster or the same with matlabļ¼Ÿ

Did youā€¦ try with using MKL on v1.7?

1 Like

I am just a basic level user but would recommend writing your test like this:

using MKL, LinearAlgebra, BenchmarkTools

function pinvf2(steps)
    for _ in 1:steps
        A = randn(1000,1000)
        pinv(view(A,:,1:400))
    end
end

steps = 100
@btime pinvf2($steps)  #  5.221 s (3100 allocations: 2.54 GiB)

My Julia is setup to run with 8 threads and I believe these are automatically used by LinearAlgebra in the computation of pinv. I have removed the multi-threading of the for loops for simplicity.

julia> versioninfo()

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Coreā„¢ i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)
Environment:
JULIA_PKG_USE_CLI_GIT = true
JULIA_STACKFRAME_FUNCTION_COLOR = blue
JULIA_WARN_COLOR = cyan
JULIA_EDITOR = code.cmd -g
JULIA_NUM_THREADS = 8

I think the bottom line is @stevengjā€™s answer further above in this thread.

Please search in Discourse for questions like yours using the tag MATLAB.

no, when I try to install MKL, it reports errors.

Then make a new thread about that error if you actually want people to be able to help you with it.

I post a issue on github

julia> using Pkg; Pkg.add("MKL")
    Updating registry at `C:\Users\xuanQS\.julia\registries\General.toml`
   Resolving package versions...
  No Changes to `C:\Users\xuanQS\.julia\environments\v1.7\Project.toml`
  No Changes to `C:\Users\xuanQS\.julia\environments\v1.7\Manifest.toml`

julia> using MKL
[ Info: Precompiling MKL [33e6dc65-8f57-5167-99aa-e5a354878fb2]
ERROR: LoadError: InitError: could not load library "C:\Users\xuanQS\.julia\artifacts\28c8373199bf79bd94ce76e3f45eeaef6d9c1c47\bin\mkl_core.1.dll"
The specified module could not be found.
Stacktrace:
  [1] dlopen(s::String, flags::UInt32; throw_error::Bool)
    @ Base.Libc.Libdl .\libdl.jl:117
  [2] dlopen(s::String, flags::UInt32)
    @ Base.Libc.Libdl .\libdl.jl:117
  [3] macro expansion
    @ C:\Users\xuanQS\.julia\packages\JLLWrappers\bkwIo\src\products\library_generators.jl:54 [inlined]
  [4] __init__()
    @ MKL_jll C:\Users\xuanQS\.julia\packages\MKL_jll\kG4RZ\src\wrappers\x86_64-w64-mingw32.jl:10
  [5] _include_from_serialized(path::String, depmods::Vector{Any})
    @ Base .\loading.jl:768
  [6] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
    @ Base .\loading.jl:854
  [7] _require(pkg::Base.PkgId)
    @ Base .\loading.jl:1097
  [8] require(uuidkey::Base.PkgId)
    @ Base .\loading.jl:1013
  [9] require(into::Module, mod::Symbol)
    @ Base .\loading.jl:997
 [10] include
    @ .\Base.jl:418 [inlined]
 [11] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
    @ Base .\loading.jl:1318
 [12] top-level scope
    @ none:1
 [13] eval
    @ .\boot.jl:373 [inlined]
 [14] eval(x::Expr)
    @ Base.MainInclude .\client.jl:453
 [15] top-level scope
    @ none:1
during initialization of module MKL_jll
in expression starting at C:\Users\xuanQS\.julia\packages\MKL\HNCoo\src\MKL.jl:1
ERROR: Failed to precompile MKL [33e6dc65-8f57-5167-99aa-e5a354878fb2] to C:\Users\xuanQS\.julia\compiled\v1.7\MKL\jl_A636.tmp.
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, ignore_loaded_modules::Bool)
   @ Base .\loading.jl:1466
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base .\loading.jl:1410
 [4] _require(pkg::Base.PkgId)
   @ Base .\loading.jl:1120
 [5] require(uuidkey::Base.PkgId)
   @ Base .\loading.jl:1013
 [6] require(into::Module, mod::Symbol)
   @ Base .\loading.jl:997

my julia version is 1.7.

ths for your help,can you test how much time it spend when steps is 1000. in matlab and numpy, itā€˜s abuot 60s.

@xuanquansheng, here is the result of the requested benchmark using @btime:

steps = 1000
@btime pinvf2($steps)
  47.334 s (31000 allocations: 25.39 GiB)

ths again,the speed seems the same with numpy, however the MKL pkg may not so stable, and maybe view and write code as a function is a technique for faster, while numpy just write for loop.