# Pseudo-inverse of large matrix very slow

Hi, first time posting here. I need to compute the pseudo-inverse of large matrix (2504, 2502) as a part of an algorithm I am developing in Python (solving Ax = b). In Python, the pseudo-inverse is a clear bottleneck in terms of computation time, so I decided to look at Julia as an alternative, to check if it could improve the times.

However, when benchmarking with Python I am actually seeing slower times when using Julia. I read that LinearAlgebra uses openBLAS so I downloaded MKL.jl package, which enforces MKL usage instead of openBLAS, but even so, I seem to get slower computation times than with python/numpy.

Python:

``````import numpy as np
A = np.random.rand(2504, 2502)
``````
``````%%timeit -r 5 -n 10
A_pinv = np.linalg.pinv(A)
``````

leads to:

``````6.59 s Ā± 630 ms per loop (mean Ā± std. dev. of 5 runs, 10 loops each)
``````

And Julia:

``````using LinearAlgebra
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 800  # so it runs all samples and evals
# BLAS.vendor() outputs :mkl
A = rand(2504, 2502)
``````
``````@benchmark LinearAlgebra.pinv(x) setup=(x=\$A) samples=10 evals=5
``````

leads to:

``````BenchmarkTools.Trial:
memory estimate:  382.53 MiB
allocs estimate:  29
--------------
minimum time:     7.398 s (0.61% GC)
median time:      7.974 s (0.73% GC)
mean time:        8.687 s (0.69% GC)
maximum time:     10.871 s (0.65% GC)
--------------
samples:          10
evals/sample:     5
``````

I also ran this more than once to make sure it wasnāt including āloading timeā for the benchmarking tools.
Any clues why this is so slow? Should I be using another package or function?

Probably itās using a different number of threads?

You can call `LinearAlgebra.BLAS.set_num_threads` for OpenBLAS in Julia. I think with MKL you have to set the `MKL_NUM_THREADS` environment variable (before launching Julia)?

PS. If you are solving a least-square or similar non-invertible system, in most cases you should use pivoted QR rather than a pseudo-inverse, via `A \ b` ā¦ or equivalently by `QR = qr(A, Val{true}())` followed by `QR \ b` (if you want to re-use the factorization for multiple right-hand sides). See also this post. (If you have an underdetermined solve and want the minimum-norm solution, you can use the LQ factorization instead.)

Switching languages will not help you here. Everyone is using the same dense linear-algebra libraries (LAPACK with some backend, typically either OpenBLAS or MKL). The only way to improve your performance if you are limited by LAPACK is by changing your algorithm (e.g. to use QR rather than SVD, or doing something even more clever).

Julia helps when you need to write your own performance-critical inner-loop code.

10 Likes

Thanks for the reply!
Indeed, you are correct about using QR factorization to solve the system, it is much faster, thank you for the suggestion.

However, Iām still confused why the Julia code is slower than numpy/python. In my Julia notebook, I tried setting

``````ENV["MKL_NUM_THREADS"] = 16
ENV["JULIA_NUM_THREADS"] = 16
``````

And even set `LinearAlgebra.BLAS.set_num_threads(16)` after loading the packages to make sure it is set. However, Iām not actually sure MKL is thread number is being affected by this. This is my `versioninfo()` after doing the above:

``````Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i5-1030NG7 CPU @ 1.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)
Environment:
JULIA_NUM_THREADS = 16
``````

However, when running the previous example again, I still get approximately the same results. Maybe this is not the correct way of setting the number of threads for MKL? I searched but could not find much.

1 Like

You have to set environment variables like `JULIA_NUM_THREADS` before launching Julia. (In an IJulia notebook, you do this by calling `IJulia.installkernel` to install a new kernel specification with the environment variables you want.) Also, note that the `JULIA_NUM_THREADS` is for native Julia threading, which is disjoint from OpenBLAS threading (for now).

See also the discussion here.

4 Likes

Thanks, seems to work now, although it is still around same speed as python/numpy.

Yes, I think this should be expected:

3 Likes

howeve. the speed of the same code in `matlab` is faster, about 10~100 times.

Did you try `using MKL` first?

no,I just use Distribution and give it 8 thread.Can you give me a good example which speed is faster or the same with matlabļ¼

Which package is this? Could you post the code and the benchmark results?

Well then, do `using MKL` first on v1.7

1 Like

sorry, I misremembered. I didnāt use` Distributions` pkg or `Distributed` pkg.
my test code as fellowsļ¼

``````# in cmd
set JULIA_NUM_THREADS=6
julia # start julia
``````
``````using LinearAlgebra
const steps = 1000
@time Threads.@threads  for a in 1:steps
A = randn(1000,1000)
A = A[:,1:400]
C = pinv(A)
end
``````

or write as a function may improve the speed?

``````using LinearAlgebra
function pinvf2()
Threads.@threads for a in 1:steps
local A
local B
local C
A = randn(1000,1000)
A = A[:,1:400]
C = pinv(A)
end
# return C
end

const steps = 1000
@time  pinvf2()
``````

Can you give me a good example which speed is faster or the same with matlabļ¼

Did youā¦ try with `using MKL` on v1.7?

1 Like

I am just a basic level user but would recommend writing your test like this:

``````using MKL, LinearAlgebra, BenchmarkTools

function pinvf2(steps)
for _ in 1:steps
A = randn(1000,1000)
pinv(view(A,:,1:400))
end
end

steps = 100
@btime pinvf2(\$steps)  #  5.221 s (3100 allocations: 2.54 GiB)
``````

My Julia is setup to run with 8 threads and I believe these are automatically used by `LinearAlgebra` in the computation of `pinv`. I have removed the multi-threading of the for loops for simplicity.

julia> versioninfo()

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Coreā¢ i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)
Environment:
JULIA_PKG_USE_CLI_GIT = true
JULIA_STACKFRAME_FUNCTION_COLOR = blue
JULIA_WARN_COLOR = cyan
JULIA_EDITOR = code.cmd -g
JULIA_NUM_THREADS = 8

I think the bottom line is @stevengjās answer further above in this thread.

Please search in Discourse for questions like yours using the tag `MATLAB`.

no, when I try to install ` MKL`, it reports errors.

Then make a new thread about that error if you actually want people to be able to help you with it.

I post a issue on github

``````julia> using Pkg; Pkg.add("MKL")
Updating registry at `C:\Users\xuanQS\.julia\registries\General.toml`
Resolving package versions...
No Changes to `C:\Users\xuanQS\.julia\environments\v1.7\Project.toml`
No Changes to `C:\Users\xuanQS\.julia\environments\v1.7\Manifest.toml`

julia> using MKL
[ Info: Precompiling MKL [33e6dc65-8f57-5167-99aa-e5a354878fb2]
ERROR: LoadError: InitError: could not load library "C:\Users\xuanQS\.julia\artifacts\28c8373199bf79bd94ce76e3f45eeaef6d9c1c47\bin\mkl_core.1.dll"
The specified module could not be found.
Stacktrace:
[1] dlopen(s::String, flags::UInt32; throw_error::Bool)
@ Base.Libc.Libdl .\libdl.jl:117
[2] dlopen(s::String, flags::UInt32)
@ Base.Libc.Libdl .\libdl.jl:117
[3] macro expansion
@ C:\Users\xuanQS\.julia\packages\JLLWrappers\bkwIo\src\products\library_generators.jl:54 [inlined]
[4] __init__()
@ MKL_jll C:\Users\xuanQS\.julia\packages\MKL_jll\kG4RZ\src\wrappers\x86_64-w64-mingw32.jl:10
[5] _include_from_serialized(path::String, depmods::Vector{Any})
@ Base .\loading.jl:768
[6] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
@ Base .\loading.jl:854
[7] _require(pkg::Base.PkgId)
@ Base .\loading.jl:1097
[8] require(uuidkey::Base.PkgId)
@ Base .\loading.jl:1013
[9] require(into::Module, mod::Symbol)
@ Base .\loading.jl:997
[10] include
@ .\Base.jl:418 [inlined]
[11] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
@ Base .\loading.jl:1318
[12] top-level scope
@ none:1
[13] eval
@ .\boot.jl:373 [inlined]
[14] eval(x::Expr)
@ Base.MainInclude .\client.jl:453
[15] top-level scope
@ none:1
during initialization of module MKL_jll
in expression starting at C:\Users\xuanQS\.julia\packages\MKL\HNCoo\src\MKL.jl:1
ERROR: Failed to precompile MKL [33e6dc65-8f57-5167-99aa-e5a354878fb2] to C:\Users\xuanQS\.julia\compiled\v1.7\MKL\jl_A636.tmp.
Stacktrace:
[1] error(s::String)
@ Base .\error.jl:33
[2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, ignore_loaded_modules::Bool)
@ Base .\loading.jl:1466
[3] compilecache(pkg::Base.PkgId, path::String)
@ Base .\loading.jl:1410
[4] _require(pkg::Base.PkgId)
@ Base .\loading.jl:1120
[5] require(uuidkey::Base.PkgId)
@ Base .\loading.jl:1013
[6] require(into::Module, mod::Symbol)
@ Base .\loading.jl:997
``````

my julia version is 1.7.

ths for your help,can you test how much time it spend when steps is 1000. in matlab and numpy, itās abuot 60s.

@xuanquansheng, here is the result of the requested benchmark using @btime:

``````steps = 1000
@btime pinvf2(\$steps)
47.334 s (31000 allocations: 25.39 GiB)
``````

ths again,the speed seems the same with `numpy`, however the` MKL` pkg may not so stable, and maybe `view` and `write code as a function` is a technique for faster, while `numpy` just write for loop.