Hi, first time posting here. I need to compute the pseudo-inverse of large matrix (2504, 2502) as a part of an algorithm I am developing in Python (solving Ax = b). In Python, the pseudo-inverse is a clear bottleneck in terms of computation time, so I decided to look at Julia as an alternative, to check if it could improve the times.
However, when benchmarking with Python I am actually seeing slower times when using Julia. I read that LinearAlgebra uses openBLAS so I downloaded MKL.jl package, which enforces MKL usage instead of openBLAS, but even so, I seem to get slower computation times than with python/numpy.
Python:
import numpy as np
A = np.random.rand(2504, 2502)
%%timeit -r 5 -n 10
A_pinv = np.linalg.pinv(A)
leads to:
6.59 s Ā± 630 ms per loop (mean Ā± std. dev. of 5 runs, 10 loops each)
And Julia:
using LinearAlgebra
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 800 # so it runs all samples and evals
# BLAS.vendor() outputs :mkl
A = rand(2504, 2502)
BenchmarkTools.Trial:
memory estimate: 382.53 MiB
allocs estimate: 29
--------------
minimum time: 7.398 s (0.61% GC)
median time: 7.974 s (0.73% GC)
mean time: 8.687 s (0.69% GC)
maximum time: 10.871 s (0.65% GC)
--------------
samples: 10
evals/sample: 5
I also ran this more than once to make sure it wasnāt including āloading timeā for the benchmarking tools.
Any clues why this is so slow? Should I be using another package or function?
Probably itās using a different number of threads?
You can call LinearAlgebra.BLAS.set_num_threads for OpenBLAS in Julia. I think with MKL you have to set the MKL_NUM_THREADS environment variable (before launching Julia)?
PS. If you are solving a least-square or similar non-invertible system, in most cases you should use pivoted QR rather than a pseudo-inverse, via A \ b ā¦ or equivalently by QR = qr(A, Val{true}()) followed by QR \ b (if you want to re-use the factorization for multiple right-hand sides). See also this post. (If you have an underdetermined solve and want the minimum-norm solution, you can use the LQ factorization instead.)
Switching languages will not help you here. Everyone is using the same dense linear-algebra libraries (LAPACK with some backend, typically either OpenBLAS or MKL). The only way to improve your performance if you are limited by LAPACK is by changing your algorithm (e.g. to use QR rather than SVD, or doing something even more clever).
Julia helps when you need to write your own performance-critical inner-loop code.
And even set LinearAlgebra.BLAS.set_num_threads(16) after loading the packages to make sure it is set. However, Iām not actually sure MKL is thread number is being affected by this. This is my versioninfo() after doing the above:
However, when running the previous example again, I still get approximately the same results. Maybe this is not the correct way of setting the number of threads for MKL? I searched but could not find much.
You have to set environment variables like JULIA_NUM_THREADSbefore launching Julia. (In an IJulia notebook, you do this by calling IJulia.installkernel to install a new kernel specification with the environment variables you want.) Also, note that the JULIA_NUM_THREADS is for native Julia threading, which is disjoint from OpenBLAS threading (for now).
sorry, I misremembered. I didnāt use Distributions pkg or Distributed pkg.
my test code as fellowsļ¼
# in cmd
set JULIA_NUM_THREADS=6
julia # start julia
using LinearAlgebra
const steps = 1000
@time Threads.@threads for a in 1:steps
A = randn(1000,1000)
A = A[:,1:400]
C = pinv(A)
end
or write as a function may improve the speed?
using LinearAlgebra
function pinvf2()
Threads.@threads for a in 1:steps
local A
local B
local C
A = randn(1000,1000)
A = A[:,1:400]
C = pinv(A)
end
# return C
end
const steps = 1000
@time pinvf2()
Can you give me a good example which speed is faster or the same with matlabļ¼
I am just a basic level user but would recommend writing your test like this:
using MKL, LinearAlgebra, BenchmarkTools
function pinvf2(steps)
for _ in 1:steps
A = randn(1000,1000)
pinv(view(A,:,1:400))
end
end
steps = 100
@btime pinvf2($steps) # 5.221 s (3100 allocations: 2.54 GiB)
My Julia is setup to run with 8 threads and I believe these are automatically used by LinearAlgebra in the computation of pinv. I have removed the multi-threading of the for loops for simplicity.
julia> versioninfo()
julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Coreā¢ i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)
Environment:
JULIA_PKG_USE_CLI_GIT = true
JULIA_STACKFRAME_FUNCTION_COLOR = blue
JULIA_WARN_COLOR = cyan
JULIA_EDITOR = code.cmd -g
JULIA_NUM_THREADS = 8
ths again,the speed seems the same with numpy, however the MKL pkg may not so stable, and maybe view and write code as a function is a technique for faster, while numpy just write for loop.