Understanding what -t does

I am running a single-threaded program and it executes ~10% faster when I run it with -t 2 compared to when I run with -t 1, i.e. when I allow two threads instead of one. The program does not explicitly refer to anything in Base.Threads. Looking at CPU usage, it seems that only one core is used in both cases. What could explain the 10% difference in performance? Does using “-t 2” mean that a second core is used by Julia for other small tasks that I did not notice? It is important for me to restrict all computation to one core.

Hard to say, but maybe something you use (from Base or other libraries) look at the number of threads and make use of it. How long the program usually takes? (If it is very little time it can be noise.) Can you keep looking at a fast-refreshing usage monitor during the duration (or save it to a log)? (To assure if it really does not spawn a second thread.)

How are you measuring performance? Are you including the time to launch julia?

I use CPUTime and the @CPUelapsed macro to measure the call that I use as benchmark. This measured call is single-threaded. So time to launch is not included. Time to compile is not an issue: repeated calls in the same session all lead to the same observation.

can you use @btime and friends to measure the main function within your script?

My understanding is that @btime measures wall time, not CPU clock time, therefore does not really measure the CPU budget used by a program, rather how long it took to run that program given the other programs competing with it for resources. Correct me if I am wrong. I want to measure the CPU budget used by a program (I do not understand why wall time would be used for benchmarking but that is another discussion).

because things like OpenBLAS can internally multi-thread even if your Julia is single thread (-t 1) making CPU time appears many folds longer (summed across threads)

Thanks, that is the kind of things I am wondering about. Looking at my program I’m not sure where there could be multi-threading. Is there any documentation on how to track that? More generally, is there no way at all to tell Julia to stick to one thread?

If two cores are used at 100% for 1s then I want to measure 2s, I believe @CPUelapsed measures that correctly for me. Again feel free to correct me.

I used @btime as you suggested, 3 times for each setting. With one thread the times are 1.695, 1.697 and 1.691. With two threads the times are 1.676, 1.676 and 1.672. Not quite 10% but still a small difference.

Yes, start Julia with a single thread. But external libraries may still spawn multiple threads, which is the case for OpenBLAS for example, as already said above. Without seeing the code, it’s hard to guess what’s happening.

Thanks. Here is the code, it is a basic implementation of the Edmonds-Karp algorithm for solving the maximum flow problem. However, even if there are multiple threads, shouldn’t CPUTime report the same numbers since the total CPU effort is the same? Actually I would expect at least as much with two threads as with one.

function edmondskarp(C::MT, F::MT, n::Int, s::Int, t::Int) where MT <: Matrix
    totalflow = zero(Float64)
    moreflow = true
    Q = Deque{Int}()
    pred = [ -1 for i ∈ 1:n ]
    for i ∈ 1:n, j ∈ 1:n
        F[i,j] = zero(Float64)
    end
    while moreflow
        # reset predecessors
        for i ∈ 1:n
            pred[i] = -1
        end
        push!(Q, s)
        while ! isempty(Q)
            cur = popfirst!(Q)
            for j in 1:n
                j == cur && continue
                if pred[j] == -1 && j ≠ s && C[cur, j] > F[cur, j]
                    pred[j] = cur
                    push!(Q, j)
                end
            end
        end
        # did we find an augmenting path?
        if pred[t] ≠ -1
            df = typemax(Float64)
            i, j = pred[t], t
            while i ≠ -1
                if df > C[i,j] - F[i,j]
                    df = C[i,j] - F[i,j]
                end
                i, j = pred[i], i
            end
            i, j = pred[t], t
            while i ≠ -1
                F[i,j] += df
                i, j = pred[i], i
            end
            totalflow += df
        else
            moreflow = false
        end
    end
    totalflow
end

It should be good practice to not do that, wright? The library should take the number of threads defined to Julia and use that number of threads at most. Of course that is on the hands of the library developer and cannot be enforced (I guess) if foreign code is used.

1 Like

I’m not sure we’re talking about the same thing. I was referring to external shared binary libraries, like OpenBLAS, which are completely independent from Julia’s internal threading model.

So, how does one set the threads on OpenBLAS? And in general, the total number of threads my Julia code uses? As OP, I sometimes run things which I need to ensure that they only run on n threads (say on a shared server)

specifically for OpenBLAS I think:

LinearAlgebra.BLAS.set_num_threads(4)

?

3 Likes

Yes, yes. But there is a Julia front-end to it, which defines (in this case) or should be able to define the number of threads used by OpenBLAS. That parameter should be set to the number of threads of Julia as default, in my opinion (in every package that defines such an interface).

1 Like

In general external libraries are free to do whatever they want and Julia has no control over them. In this specific case, OpenBLAS happens to let you control the number of threads to use and Julia has an interface to that.

Sure, what I am saying is that most of the times, if not always, multi-threaded packages have some parameter that sets up the number of threads to be used. And that Julia interfaces to those package should be written in such a way that that parameter should be set to the number of threads available to Julia. As I mentioned, you cannot enforce that from Julia, but it would be a good practice to develop interfaces with that in mind.

1 Like

Back to my original question, how can -t 2 lead to lower CPU time (as measured using CPUTime) than -t 1? Shouldn’t it be at least as much? I feel I am missing something here.

It would be helpful to provide a MWE. I tried your function with random matrices but could not observe anything.