Parallelization of long- and short running tasks

Assume you have a long-running, “expensive” function, e.g.,

function longrun()
     dim = 20000
     A = rand(dim,dim)/dim
     A^256
end

which may take a few minutes to complete, and a function that is short-running and latency-critical, e.g.,

using Dates # required to print inter-service times
function heartbeat()
    dim = 1000
    A = rand(dim, dim)/dim
   
    t = now() 
    while (true)
        A^256
        println("alive: $(now()-t)")
        t = now()
    end
end

whose individual iterations in the while loop typically take less than a second to complete (if run separately).

I’m calling this function using julia -t auto as follows

routines = [longrun, heartbeat]

@sync for routine in routines
     Threads.@spawn routine()
end

As I have plenty of resources, I’d naively expect the heartbeat function to print continuously, independently of the long-running function.

This is however not the case. The long-running function seems to block the heartbeat function almost entirely:

alive: 447 milliseconds
starting exponentiation
alive: 391 milliseconds
exponentiation done
alive: 136699 milliseconds
alive: 205 milliseconds
alive: 265 milliseconds

I’m aware that this is a known problem (e.g., see How to perform ongoing background work with a real time ui - #4 by tkf), but I’m not sure how to tackle it best (without introducing dedicated processes). I’ve also tried reducing the number of BLAS threads, or using a Julia-native matrix multiplication library instead (Octavian.jl), but the issue remained.

I suspect that using two different thread pools (as introduced with 1.9) may be an option, but I fail to see how exactly.

Any suggestions would be more than welcome. I’m using Julia v1.9 on Linux.

Here’s the full code (including a failed attempt at using 1.9’s interactive feature):

using Dates

function longrun()
     dim = 20000
     A = rand(dim,dim)/dim
     
     println("starting exponentiation")
     A^256
     println("exponentiation done")
end

function heartbeat()
    dim = 1000
    A = rand(dim, dim)/dim

    t = now()
    while (true)
        A^256

        println("alive: $(now()-t)")
        t = now()
    end
end

@sync begin
    Threads.@spawn :interactive heartbeat()
    Threads.@spawn longrun()
end

You are right that using the interactive threadpool is the solution. What does Threads.nthreads(:interactive) give you? I think the problem is that if you have export JULIA_NUM_THREADS=X defined that (I think mistakenly) sets the number of interactive threads to 0. This can be fixed by export JULIA_NUM_THREADS=X,1 but I will make an issue for this.

2 Likes

Thank you for the prompt reply.

I’ve been starting Julia via

julia --threads 8,8

which yields (as expected)

Threads.nthreads(:interactive) = 8
Threads.nthreads(:default) = 8

Hence, there seem to be several interactive threads available!

The other thing that might be a problem here is you are trying to do a significant amount of work on the interactive thread. Part of the design of the interactive thread is that for it to stay interactive, you aren’t supposed to be doing significant amounts of computation on it. If you replace your heartbeat function with

function heartbeat()
    t = now() 
    while (true)
        sleep(.3)
        println("alive: $(now()-t)")
        t = now()
    end
end

it works as expected. To be able to log more useful information, you probably want to set up a channel that the worker thread pushes to at regular intervals and the interactive thread then just reads the results from the worker thread and prints that.

1 Like

Thank you very much once more for your answer.

I confirm that removing any matrix exponentiation from the heartbeat function solves the issue (even without any interactive threads!).

However, even for a small matrix exponentiation in the heartbeat function, there is complete thread blocking. Specifically, I’ve tested with a matrix of size 100x100, whose exponentiation takes less than 1 millisecond. I’ve tested both with and without interactive threads.

My use case is as follows: In the context of a microservice, I have a few routines with a “cycle time” of between 1 and 100 ms, some of which involve number crunching. In addition, I have one routine with a cycle time of a few minutes, with some heavy, monolithic number crunching (exponentation of a matrix of size 10,000 x 10,000). This is why, in my example, both the heartbeat and the longrun function involve some linear algebra.

Is there a way of avoiding thread starvation within a single process in such a use case (not necessarily involving “interactive threads”), or are separate processes the only way to go?

OpenBLAS appears to have some global structures (buffers, IIRC) managed by locks which prevent progress in some nested threading situations like yours. You might try using MKL instead, but it may take some environment variables or c-calls to get it right. @carstenbauer may know more.

(In the long run perhaps some of us can find resources to help the OpenBLAS team attack this challenging issue).

Thank you very much. I had also thought it is related to BLAS, but then replaced the matrix exponentiation by a Julia-native implementation using Octavian.jl.

The problem is slightly alleviated, but clearly still persists:

using Octavian
using Dates

function longrun()
  dim = 10000
  A = rand(dim,dim)/dim

  println("starting exponentiation")

  for i = 1:3
      A = matmul(A,A)
  end

  println("exponentiation done")
end

function heartbeat()
    dim = 1000
    A = rand(dim, dim)/dim

    t = now()
    while (true)
        for i = 1:3
            A = matmul(A,A)
        end

        println("alive: $(now()-t)")
        t = now()
    end
end

@show Threads.nthreads(:interactive)
@show Threads.nthreads(:default)

@sync begin
    Threads.@spawn :interactive heartbeat()
    Threads.@spawn longrun()
end

yields

$ julia --threads 8,8 mt.jl
Threads.nthreads(:interactive) = 8
Threads.nthreads(:default) = 8
alive: 167 milliseconds
alive: 45 milliseconds
alive: 217 milliseconds
alive: 43 milliseconds
starting exponentiation
alive: 235 milliseconds
alive: 503 milliseconds
alive: 13166 milliseconds
alive: 38 milliseconds
alive: 44671 milliseconds
alive: 303 milliseconds
exponentiation done
alive: 18345 milliseconds
alive: 28 milliseconds
alive: 93 milliseconds
alive: 26 milliseconds

So, the issue may not be related exclusively to OpenBLAS, and at this moment, I continue to believe that separate processes are the only way to go.

(I had also played with the single-threaded matrix multiplication provided by Octavian.jl, matmul!(…, nthreads=1), and If I remember correctly, even then the issue persisted.)

I’m able to grapple with this idea at a high level, but I’d like to understand it more deeply. Are you familiar with the thread scheduling code? Would you be able to point me toward the code that I can read in order to understand scheduling at a more granular level?

the code here is mostly in task.c, but the coffee isn’t that important for understanding here. the way to use this is that the interactive task should only be used for very quick computation and if you need to do something that takes a while, spawn a regular task to do the work.

1 Like