Multithreading a loop with ARPACK eigs()

I’m attempting to multithread a calculation in which I loop over a function that calls on ARPACK’s eigs(). However, whenever I insert the Threads.@threads macro in front of the loop I either get an error or Julia crashes altogether. The following MWE reproduces this behavior on my system:

using Arpack
Threads.@threads for i in 1:10
    λ, ϕ = eigs(rand(500, 500), maxiter = 1000)
end

If it makes things any clearer, I get the following stacktrace:

ERROR: TaskFailedException:
ARPACKException: unspecified ARPACK error: -9999
Stacktrace:
 [1] aupd_wrapper(::Type{T} where T, ::Arpack.var"#matvecA!#24"{Array{Float64,2}}, ::Arpack.var"#18#25", ::Arpack.var"#19#26", ::Int64, ::Bool, ::Bool, ::String, ::Int64, ::Int64, ::String, ::Float64, ::Int64, ::Int64, ::Array{Float64,1}) at C:\Users\Julius\.julia\packages\Arpack\o35I5\src\libarpack.jl:76
 [2] _eigs(::Array{Float64,2}, ::LinearAlgebra.UniformScaling{Bool}; nev::Int64, ncv::Int64, which::Symbol, tol::Float64, maxiter::Int64, sigma::Nothing, v0::Array{Float64,1}, ritzvec::Bool) at C:\Users\Julius\.julia\packages\Arpack\o35I5\src\Arpack.jl:181
 [3] #eigs#10 at C:\Users\Julius\.julia\packages\Arpack\o35I5\src\Arpack.jl:46 [inlined]
 [4] #eigs#9 at C:\Users\Julius\.julia\packages\Arpack\o35I5\src\Arpack.jl:45 [inlined]
 [5] macro expansion at .\REPL[2]:2 [inlined]
 [6] (::var"#2#threadsfor_fun#3"{UnitRange{Int64}})(::Bool) at .\threadingconstructs.jl:61
 [7] (::var"#2#threadsfor_fun#3"{UnitRange{Int64}})() at .\threadingconstructs.jl:28
Stacktrace:
 [1] wait(::Task) at .\task.jl:267
 [2] top-level scope at .\threadingconstructs.jl:69

The ARPACKException changes from time to time, I’ve also seen 1 and 3. I get the sense ARPACK is not suitable for multithreading like this, but I don’t really understand why, and I’m curious if there’s a workaround.

This is using Julia 1.4.2 and Arpack 0.4.0, running on a Windows system.

My understanding is that Julia’s threading system doesn’t necessarily play nicely with others, which isn’t so surprising given all that threads have to orchestrate.

Is using a different parallel approach an option for you? I would recommend trying the above with Distributed instead. That should work fine.

The underlying implementation is not thread safe, so undefined behaviour is to be expected. I highly recommend @stabbles ArnoldiMethod.jl as an alternative.

Thank you for your suggestions. I’ve tried to set it up using Distributed. However, if I run Julia using julia -p auto (and check that numworkers() returns 4), and then execute the following:

@time @sync @distributed for i in 1:10
	λ, ϕ = eigs(rand(500, 500), maxiter = 1000)
end

I don’t get any speedup whatsoever compared to running it without the @distributed macro. I verified that Distributed does work by doing a dummy test where I evaluate sum(rand(Bool, 1000000)) up to a thousand times. Then I get a factor of two improvement compared to serialized execution.

eigs is itself multithreaded. So in a multicore machine all the processors will be used even without using julia’s Threads or Distributed.

Using Distributed, it is possible that in each julia process, BLAS is spawing its own threads which can compete for resources with the BLAS threads from the other processes. You can try to make
@everywhere BLAS.set_num_threads(1),
so that in each julia process there is only one BLAS thread. Even then, it is not guaranteed that you will see an improvement: it will depend on the size of the matrices and how many you are diagonalizing. In some cases, it might be better to not use Distributed and have eigs use all the cores of your machine.

2 Likes

To be more precise, it is the ARPACK callback routines which are multithreaded (using the normal Julia infrastructure, incl. broadcast and BLAS) - the underlying library is not. As you point out, the consequences of this are problem size dependent. Profiling is required, but roughly speaking:

  • For large matrices, the optimum approach will be just to call eigs on each matrix in turn, allowing BLAS to saturate the processor. This could be combined with a distributed approach over multiple machines.

  • For small matrices, where initialisation and ARPACK itself constitute proportionally more work, then one can either take the distributed approach discussed (limit BLAS to single core, distribute over multiple processes and loop over a subset), or use a fully multithreaded library such as that I linked to previously. I have found the latter approach to be very effective when looking for a small number of eigenvalues (which I assume is the intent of the OP given the call to eigs).

If the sample code is truly representative of the problem (e.g., dense matrices of size 500 x 500), and a GPU is available, it might also be worth profiling a full dense factorisation using the batched routines provided by CuSolver. I am unsure if there is a simple interface provided by CUDA.jl but the routines are wrapped and can be called with a little work.

3 Likes

Thank you for all the suggestions.

@samuelpowell my matrix is actually quite large, 213444x213444 elements, albeit sparse. With your and @Bruno_Amorim’s remarks I think it makes sense that there was no improvement in performance. I have tried the ArnoldiMethod.jl package you mentioned, and this actually improved the performance by quite a bit, so thank you for the recommendation!

1 Like