Question on behaviour and performance of getting the determinant

Confusing behaviour

I’ll jump directly into the behaviour I’m observing, please look below for context (under other header). Direct questions can be found in bold.

I have a function that computes the determinant as a part of it

function diag_penalty(m)
    if isnzero(abs(det(m)))
        return 10
    end
    return 0
end

when I run my program using this function, the execution takes around 1 minute and 18 seconds (±2s) as measured by a progress bar from ProgressMeter (this function is called 2805 times during that execution).

Taking a look at htop during the execution I notice that the program uses all CPU threads. I take a look at Threads.nthreads() which shows 1 so: How is det() using all available threads and how can I disable it?

I verify that it is det() causing this behaviour by switching the relevant line so that the function is instead

function diag_penalty(m)
    if any(isnzero.(eigvals(m)))
        return 10
    end
    return 0
end

My program then finishes in 20 seconds (±1s) measured in the same way as before, and looking at htop the execution happens on one thread.

It seems odd that the eigenvalue calculation is faster than the determinant calculation, especially since the determinant appears parallelized. How can a calculation using det() be slower than eigvals() here?

My suspicion that this is indeed strange is verified by in a Jupyter notebook comparing the two:

using LinearAlgebra
mats = [ rand(22,22) for _ in 1:200000 ]
@time for m in mats
    det(m)
end
@time for m in mats
    eigvals(m)
end

  9.433334 seconds (1.20 M allocations: 836.174 MiB, 1.39% gc time)
 17.810404 seconds (3.20 M allocations: 7.546 GiB, 1.40% gc time)

So the determinant is indeed faster here, but not in my program. I verify with htop that the det() part is computed on all available threads. What can I be doing wrong in my program that makes a computation that was basically twice as fast become much more than four times as slow?

Context

I’m not completely sure what parts of my whole program is relevant, and I have not yet worked out a minimal example. But I will sketch the structure here for context. Questions and suggestions are welcomed to try and pinpoint the issue.

I’m designing a fitness function to be used in with BlackBoxOptim. The fitness function is divided into what I have been calling “penalties”, and the above diag_penalty() is one of many. The structure is of the fitness function is basically:

function calc_fit(x, parameters; kwargs)
    m = prep(x)                     # prepare input
    weights, ... = parameters       # weights and other parameters handed over to calc_fit()
    penalties = []                  # compute and store all penalties
    push!(penalties, diag_penalty(m))
    push!(penalties, diff_penalty(m))
    ...
    return penalties                # fitness generally weighted sum of penalties, 
                                    # but one of the kwargs used by aforementioned 
                                    # program returns the penalties instead
end

The program mentioned above does not use BlackBoxOptim, but just reads in a bunch of inputs to the fitness function from file and evaluates the fitness function on all of them (17745 different inputs). The inputs are the same every time, so diag_penalty() is called an equal amount of times in both tests. For the purpose of this post, I also evaluated the fitness function on all inputs before executing the rest of the program to make sure that it is not an issue of Julia having to compile det() that takes out all that time somehow.

The matrices are such that the if in the diag_penalty() are triggered equally at each test, hence whatever the program does afterwards is independent of which version of diag_penalty() I’m using.

Edit: fixed the number of inputs the test is run on.

I don’t have a good answer to your overall problem but for the specific question

The answer is that det is ultimately calling the openBlas library (unless you built julia against a different blas library such as MKL) to compute the determinant and openblas is threaded. It will use a number of threads up to the number of physical cores on your machine by default. You can force it to run single threaded by running

LinearAlgebra.BLAS.set_num_threads(1)

It seems like there’s a problem with det when running multithreaded with small matrices. Benchmarking det on a 1000 X 1000 random matrix I see acceptable parallel scaling but with small matrices like the 22 X 22 random ones in your test, I see a pretty much linear negative parallel scaling on my 10 core desktop machine. It makes sense that threading wouldn’t give a speedup for such small matrices but I’m surprised it kills performance so badly.

Anyway, I would re-run your tests with LinearAlgebra.BLAS.set_num_threads(1) and see if you get performance more in line with your expectations.

1 Like

I don’t think this is the best way to do benchmarking. You should try BenchmarkTools for this:

julia> using BenchmarkTools

julia> @btime det(M) setup=(M=rand(22,22));
  7.983 μs (3 allocations: 4.22 KiB)

julia> @btime eigvals(M) setup=(M=rand(22,22));
  84.274 μs (14 allocations: 39.52 KiB)

I’m seeing a factor of 10 difference here.

1 Like

It depends on the data types. It could be that your determinant calculation is using some data-type where it falls through to the generic linear-algebra fallback (which works for any type, but is not as fast as LAPACK for Float64), whereas for eigenvalues it always converts to floating-point and calls LAPACK/OpenBLAS.

It’s hard to know say without a working example that illustrates your problem.

Checking whether a determinant is nonzero looks pretty suspicious to me from a numerical standpoint, by the way. The matrix could be singular but have a nonzero determinant due to roundoff error (or vice versa), and in general it’s rarely useful to have a sharp distinction between exactly singular and nearly singular (badly conditioned) matrices. Could you explain more about what problem you are solving and why a determinant arises?

4 Likes

First of all thanks to everyone chipping in. I will comment on your individual comments, but since I had several questions, I will outline what is going on here. TL;DR at the bottom.

The long story

It boils down to comparing three computations:

  • determinant on multiple threads
  • determinant on single thread
  • eigenvalues (not parallelized)

The matrices relevant for me are 22x22, so not too large I suppose. My expectation would be that these should perform as fastest to slowest as ordered above, that is:

  1. determinant on multiple
  2. determinant on single
  3. eigenvalues

What I was observing was this order instead:

  1. determinant on single
  2. eigenvalues
  3. determinant on multiple

Before my question here, I did not know how to test “determinant on single”, but thanks to @Pbellive’s reply I could test it by

LinearAlgebra.BLAS.set_num_threads(1)

The same reply also gives me the answer to why “determinant on multiple” performs so badly, my matrices are too small to take advantage of the parallelization.

The remaining central question I would then have is why am I observing that “determinant on multiple” performs better than “eigenvalues” in the test I did in the the Jupyter notebook mentioned in my post, and not the same as with my saved matrices.

The difference between the two tests are only the matrices, nothing else. Which is a bit of a bummer, because there was no chance of you being able to answer this, so I feel a bit guilty for taking up your time. Still, this is the lesson I drew from this, so continue reading if you want to know.

The 17745 matrices I have saved and tested on are matrices with integer entries ±1,0. I redid my test after @stevengj’s comment,

It depends on the data types. It could be that your determinant calculation is using some data-type where it falls through to the generic linear-algebra fallback (which works for any type, but is not as fast as LAPACK for Float64 ), whereas for eigenvalues it always converts to floating-point and calls LAPACK/OpenBLAS.

to make sure that they are converted to Float64 before doing my test, and no luck there. Somehow, “eigenvalues” were still faster.

I tried to emulate my matrices more by instead of sampling rand() using

convert.(Array{Float64,2}, [ round.(Int, 2 .* rand(22,22) .- 1) for _ in 1:200000 ])

Still no luck, “eigenvalues” are still much slower for these matrices compared to the “determinant on multiple”. Another property of my matrices are that they have a lot of zeroes, the mean number of non-zeroes in my collection of 17745 matrices was ≈37.4, so I emulated this property by instead sampling a Normal distribution with small variance:

convert.(Array{Float64,2}, [ round.(Int,rand(Normal(0,.283), (22,22))) for _ in 1:200000 ])

for this sample, I observed the same ordering as with my saved matrices!

So what happens is that as the number of zeroes increase, the speed of the “eigenvalue” calculation increases, eventually winning over the “determinant on multiple”.

TL;DR

conclusion

The calculation of eigenvalues can be faster than the determinant calculation with multiple threads is the matrix has a lot of zeroes. The calculation of the determinant for my rather small matrices is fastest with a single thread in all tests.

summary of questions & answers

How is det() using all available threads and how can I disable it?

LinearAlgebra.BLAS.set_num_threads(1) thanks to @Pbellive.

How can a calculation using det() be slower than eigvals() here? What can I be doing wrong in my program that makes a computation that was basically twice as fast become much more than four times as slow?

Turns out nothing was wrong. My matrices were just that way that this happened; many zeroes speeds up the computation of eigenvalues.

Edit: Hmm… replying on this site does apparently not work as I hoped. Sorry for the mess…

Thanks for this, definitely the answer to that question, and helped me progress.

As you can see in my large summary of the resolved problem, it was not the whole story, but indeed this helped.

Thank you.

Was indeed not to careful about this, but it did not turn out to be my issue. And good to know for the future.

Yes. After posting I wanted to abandoned the use of det() anyway for this reason – I still wanted to figure out the performance issue I was seeing though.

I wanted to check the diagonalizability of a matrix, so using rank(), det(), or eigvals() on the eigenvectors are some options we were exploring. The determinant was very dangerous since the eigenvectors are normalized and it is naturally small and hard to distinguish from a numerical zero. And isnzero() I realize now is a very bad name for my function that checks if somethings appears to be a numerical zero (and not is-not-zero).

Thanks. I will take this into consideration if I want to make proper benchmarks in the future.

Having a sharp distinction between diagonalizable and non-diagonalizable (defective) is rarely useful. Even matrices that are nearly defective (where the eigenvector matrix is very ill-conditioned) are problematic to diagonalize. If the eigenvector matrix is ill-conditioned, then eigenvectors and eigenvalues are extremely sensitive to small perturbations and they are rarely useful ways of looking at the problem, as well as leading to numerically inaccurate algorithms. (Trefethen has a book on this topic.)

You can use cond to check the condition number of the eigenvector matrix, which is a better metric (than rank, det, or eigvals) for determining how close to defective the problem is. However, if you have matrices that may be nearly defective, you may need to re-think your whole approach in order to avoid diagonalization entirely.

(You haven’t said what underlying problem you are solving, so it is hard to say more.)

2 Likes

Thanks, I will take a look at cond.

Unfortunately not an option to forget about it.

What we are trying to find a matrix n (integer matrix) such that m = n * d * n' * d (d some given integer matrix) has the smallest abs(tr(m)/2) possible given a number of constraints on m (and a matrix similar to m). This is the one-sentence summary of the problem we are looking at, and the theory for it is in here [1]. One of these conditions is diagonalizability of m.

We have had great success of finding such solutions using differential evolution (using BlackBoxOptim.jl) and other search-methods. Some of the conditions we have to obey for m we know we will never be able to solve exactly, but we can verify the search result in Mathematica. We simply need the search to weed-out as much bad candidates as possible, and as fast as possible, without throwing out any good candidates.

[1] [1003.4867] F-Theory and the Landscape of Intersecting D7-Branes