Bug in sleep() function - main thread work affecting sleep duration on running tasks

The main thread is part of the interactive threadpool, so the best advice is perhaps to schedule computationally intensive tasks on the default threadpool instead (they can migrate within that threadpool to improve load balance, but won’t block the interactive one).

2 Likes

You are right - however, in the current scenario (MWE running with nthreads > 1), the actual behavior looks like a task spawned on a free and interactive thread is migrating to the main thread where intensive work is being done (I tested it even with 12 threads - the same result). Obviously, I am not saying that is literally what happens.

I am pretty confident that in the current MWEs scenario (both the one in OP and the sever related one) it is not just a misuse/misinterpretation of @spawn concept.

I think we agree that not having control over where a task will run doesn’t imply that the task will migrate to the most busy thread when it can migrate on any of the other 10 free threads. In the MWEs context, although migration is not actually happening, the responsivity of the tasks running on a different thread from the main thread is actually dependent on the main thread being free (however, keep in mind the MWE context - I am not saying that any task is behaving the same).

1 Like

I agree with this - however if I have 12 interactive threads and my task can migrate on any of them… I wouldn’t expect the responsivity of that task to be dependent on the main thread.

I agree that we can do some common-sense reconfiguration of the MWE and keep everything responsive. But again, the issue is that certain types of tasks depend on the main thread (regardless of running on :interactive or :default threadpool). We can jump around the issue, but it still brings with it a pretty extreme set of restrictions relative to how we should design programs in Julia.

1 Like

I think, at this point, it is not so relevant if we can or cannot find a workaround for the specific issue illustrated by the MWEs (we can).

The general advice of keeping the main thread for doing trivial computations might be the best practice, but we can still get into serious trouble.

Consider the following MWE:

using Base.Threads, Dates
using HTTP

function sleeper(s, c,)
    counter = 0
    @time while counter < c
        sleep(s)
        counter += 1
    end
end

for _ in 1:1000
    @spawn :interactive sleeper(0.01, 1000)
end

readline()

You’ll notice that with the particular parameters I used above, the average time for running a while loop is about 13.2 seconds (expected: ~10 seconds). And the GC time cannot explain the additional 30+%.

That additional time comes from the `sleep’ battle on the main thread (notice that I do not run any computation on the main thread).

Spawn 10000 sleeping tasks, and the expected 10-second skyrockets to over 20 seconds (and 5% GC cannot explain 2x penalty).

Now imagine that HTTP.jl is spawning new tasks for each connection: those tasks are basically doing crazy main-related stuff.

So no, keeping the main thread free of serious computation is going to help - but the results are looking very sad.

You can have a big CPU with lots of threads and Julia will get worse as the number of threads increases.

On my machine, running the example above with 10.000 tasks and 4 threads will result in around 20 seconds per task. However, if I run the exact same code with 12 interactive threads, the total time per task jumps to> 30 seconds.

A kind of give it more resources so you can get worse results.

I underline the fact that the main thread is not running any other computation than initially spawning the tasks.

And here is an even crazier thing: you actually need to downgrade your server resources if the number of connections/users increases - because you will get a less responsive server as you scale up the number of threads.

Have you opened an issue on the repo with this as the minimal reproducible example yet? I think it’s quite important that this is getting tracked there.

5 Likes

Trying to catch up, if I’m understanding this correctly, Julia’s scheduler is only running on the main task, so the issue is that while the main task is too busy to schedule, it stalls tasks in the queue waiting for a thread? I never knew much about the implementation details, and the docs don’t really lay this stuff out.

Again, I don’t quite know the implementation details, but IIRC from the 1.9 highlights, this might be a problem? Before 1.9, every thread was “default”, and an available thread took whatever task was at the front of the queue. Now there’s 2 pools of threads, default vs interactive, and you schedule a task to either pool. AFAIK there’s nothing different about the 2 pools, fewer tasks in a pool is just more responsive so you can separate out the few you need.

Spawning 1000 tasks into the interactive pool seems like it’d defeat the purpose. I think that’s why (and I can replicate it) the @time reports increase with number of tasks. I don’t think it’s unexpected for 1000 0.01s sleeps to end up taking >10s; sleep is documented to block the task for a specified time, that does not imply the task starts up after that time, only that it’s ready to take the next opportunity. The more often tasks yield back to the queue (good for interactivity) and the more tasks in the queue (bad for interactivity), the more work the scheduler does to find the next ready task, and that adds up. Replacing sleeper(0.01, 1000) with sleeper(10.0, 1) brings the @time from 11.852s to 10.014s because it does less of that work.

You should also say how many default and interactive threads your Julia process was set to, nthreads(:default) and nthreads(:interactive).

Not quite - imagine that everything is scheduled and the only “work” done by the tasks is to sleep repeatedly for short periods of time. The problem is that sleep function (actually the low-level implementation) needs to access the main thread - and now you have a big number of tasks trying to access the same lock at the same time (I think is a spin lock). This is also why increasing the number of threads generates even worse results.

The MWE can also be changed by adding a Ref{Bool} and only switching it to true after we know that all the tasks were scheduled (and by doing this, we will get even worse results: because before the initial tasks actually had less competition - since there were still some tasks to be scheduled).

I think this also answers this: that does not imply the task starts up after that time, only that it’s ready to take the next opportunity.

Spawning on :default actually produces 3-5x worse results than on the :interactive.

Actually, you can decrease the number of tasks and only shorten the sleep interval and still get the same penalty.

I tried various combinations of :default/:interactive (with max 12 threads) - the best results were achieved by a high number of interactive threads.

I agree that this is the case in general - but I don’t observe this here. Especially given the fact that if I keep the tasks + yielding constant and I only increase the available threads things are worsening. Also - keeping the tasks + threads constant but increasing the yield does not improve interactivity - but on the contrary.

Maybe I’ll find the time to actually do some graphs on all this (also using regular yield vs sleep).
Also - please be aware that doing actual work for the desired interval instead of sleep is producing predictable results - and the scheduler has the same to do list (e.g., finding the next task).

1 Like

Another interesting experiment is to mix “real” workers with sleepers.

And the result is revealing: the same number of yields - but the real workers are doing well. The sleep users… not so well:

If the real problem is related to the scheduler work, you would also expect to see the other yielders being impacted. But we cannot see that.

using Base.Threads, Dates
using HTTP

@info "Started with $(nthreads()) threads"

const ison = Ref{Bool}(false)

function sleeper(s, c,)
    while !ison[]
        sleep(1)
    end
    counter = 0
    @time while counter < c
        sleep(s)
        counter += 1
    end
end

function worker(s, c,)
    while !ison[]
        sleep(1)
    end
    counter = 0
    x = 0.0
    @time while counter < c
        t = now()
        while (now() - t) < Millisecond(s * 1000)
            x += rand()
        end
        yield()
        counter += 1
    end
    @info "worker finished."
end

for i in 1:1000
    if i % 100 == 0
        @spawn :interactive worker(0.01, 1000)
    else
        @spawn :interactive sleeper(0.01, 1000)
    end
end


sleep(10)

ison[] = true

readline()
1 Like

Was this the benchmark with @spawn (which defaults to :default) or @spawn :interactive? Needs clarification with the various benchmarks showing up.

Not sure why this is, maybe with more threads yielding the scheduler has to do more work? I’d expect the @time to be worse as tasks wait more but the overall time (maybe add a big @time around everything) to be less as more tasks run at the same time.

Makes sense, more yields is just 1 condition for more interactivity, and it’s not an independent factor. When many tasks have to compete for a few threads, more yields is just more waiting.

I think it’s because workers are treated different from sleepers. When a thread finds a worker in the queue after its 1s big sleep, it can just take it. When a thread finds a sleeper after its 1s big sleep, it has to check if its little iterative sleeps are over, and if not it pushes it to the back of the queue and checks the next one.

All the reported values were with :interactive (if not clearly stated that :default was used).

I think this premise/hypothesis is invalidated by the fact that when mixing “real” workers and “sleepers”, only the sleepers are suffering the penalty (regardless of identical yield stats).

I tried to keep all things equal when comparing the “sleepers” and “workers” in the previous post (video one). See the benchmark designed for the mix scenario alone (my previous post).

Yeah that’s the one I’m talking about. It looks like there’s too many tasks to be worth printing a message before and after delineating when a specific task starts and ends, but I think it’ll be informative for both sleeper and worker if you printed an end message including the counter, since each iteration yields once before incrementing the counter, the counter indicates how many times a task yielded. I think that could let you estimate how often a task ends up back on the queue and maybe what it does there.

Printing is mainly going to break the experiment (tried before) - because as it increases the time from about 30 seconds to about 500 seconds - and you can see stuff like this:

sleeper 719 with counter 975
sleeper 798 with counter 997
sleeper 836 with counter 968
sleeper 306 with counter 977
sleeper 833 with counter 994
sleeper 262 with counter 995
sleeper 244 with counter 996
sleeper 56 with counter 991
sleeper 65 with counter 986
sleeper 523 with counter 985
sleeper 514 with counter 960

Not very revealing in my opinion.

My mistake, I misread the worker as being time limited, but both versions have a counter maximum of c. It does seem strange that you’re printing counter numbers less than c=1000 though, were you printing inside the loop?

So to rectify my misconception, this is the difference between sleeper and worker.

        sleep(s)

        t = now()
        while (now() - t) < Millisecond(s * 1000)
            x += rand()
        end
        yield()

Sleeper yields to the queue and will decline threads for s seconds. Worker stays in a thread crunching numbers for s seconds before yielding to the queue. I think this does account for the difference. The workers come only every 100 iterations, but the scheduler blows through all the sleepers to the next worker and the workers end up occupying the threads, stalling sleepers from their increments.

That is not the case - because you get the same outcome for sleepers - regardless of having or not having workers.

I only added workers to the mix to show that they are not affected by the same disease - but their presence does not affect the sleeper outcome.

Also, if I add only one worker to the mix (spawned at the end as 1000th task), I get something like this:

10.039689 seconds (2.04 M allocations: 60.618 MiB)
worker 1000 finished with counter 1000.

and 6 seconds later... the sleepers

 16.523893 seconds (4.00 M allocations: 118.911 MiB, 0.06% compilation time)
 16.530893 seconds (4.01 M allocations: 119.007 MiB, 0.06% compilation time)
 16.567969 seconds (4.01 M allocations: 119.084 MiB, 0.06% compilation time)
 16.599408 seconds (4.02 M allocations: 119.301 MiB, 0.06% compilation time)
 16.565056 seconds (4.01 M allocations: 119.060 MiB, 0.06% compilation time)
 16.553888 seconds (4.01 M allocations: 118.957 MiB, 0.06% compilation time)
 16.591849 seconds (4.02 M allocations: 119.249 MiB, 0.06% compilation time)

No, there’s no dedicated scheduler task. It’s the yielding task that pops a new task from the queue as it yields. Explained in this blog post: Announcing composable multi-threaded parallelism in Julia (caveat that things may have changed since 2019, I don’t know any more about this than what I read in docs and blog posts).

2 Likes

Not yet - because there are multiple facets to it, I want to present a compelling and reproducible scenario (it turns out simple is harder than just putting together lots of examples).

I’ll do it at the start of the next week.

1 Like

To clarify, does “running on the main thread” mean Threads.threadid() == 1 ? Another possibility is Base.current_task() == Base.roottask.

This makes sense, but maybe the main-stalling-scheduler phenomena is coming from this mentioned earlier?

But that goes into ccalled code so no idea how to profile what goes on there.

I have a suggestion, you’ve been occupying a thread by looping += rand() and checking a timer in your worker tasks, but maybe you could simplify it to Libc.systemsleep? I’m not quite sure if there’s some gotcha there in multiple threads, but my understanding is it’ll occupy a thread for the specified amount of time, too.

Maybe this wouldn’t be necessary for the test but is it at all possible to emulate Base.sleep in pure Julia? AFAIK the idea is the scheduler pushes the task to the back of the queue if the specified time had not passed, couldn’t we make the scheduler check now() - t < ms? Not that I know how that would even work…

I think I’m missing something.

One Thread

$ time julia test.jl
[ Info: worker 1 executing on thread 1 
[ Info: sleeper 1 executing on thread 1 
  9.999948 seconds
[ Info: worker 1 finished. result: 2.625177517116156e7
 10.011311 seconds (110 allocations: 7.266 KiB)
[ Info: sleeper 1 finished
[ Info: worker 1 executing on thread 1 
[ Info: sleeper 2 executing on thread 1 
  9.999098 seconds[ Info: mainspin 2 executing on thread 1 

[ Info: worker 1 finished. result: 2.629067438458048e7
  4.999161 seconds
[ Info: mainspin 2 finished 
 10.006247 seconds (31.71 k allocations: 2.278 MiB, 0.58% compilation time)
[ Info: sleeper 2 finished

real	0m40.593s
user	0m25.567s
sys	0m0.249s

Eight Threads

[ Info: worker 1 executing on thread 5 
[ Info: sleeper 1 executing on thread 2 
  9.999632 seconds (62.01 k allocations: 4.402 MiB)
[ Info: worker 1 finished. result: 2.602267984094845e7
 10.001198 seconds (318 allocations: 18.039 KiB, 0.16% compilation time)
[ Info: sleeper 1 finished
[ Info: worker 1 executing on thread 2 
[ Info: sleeper 2 executing on thread 8 
[ Info: mainspin 2 executing on thread 1 
  9.999569 seconds (74.30 k allocations: 5.268 MiB, 1.14% compilation time)
 10.057717 seconds (31.54 k allocations: 2.268 MiB, 1.14% compilation time)
  4.999752 seconds (24 allocations: 2.383 KiB)
[ Info: worker 1 finished. result: 1.6082932726529311e7
[ Info: sleeper 2 finished
[ Info: mainspin 2 finished 

real	0m20.730s
user	0m22.084s
sys	0m3.818s

I think the OP is trying to tell me that the 1 thread case takes 44 seconds on some system, but I’m counting at most 41 seconds by stop watch.

I’m running this on one of my older systems.

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD FX(tm)-8350 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, bdver1)
  Threads: 1 on 8 virtual cores

Edit: I got the same results with Julia 1.9.2

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD FX(tm)-8350 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, bdver1)
  Threads: 1 on 8 virtual cores
1 Like

…and I can replicate these timings on a single thread. The code in the OP now was based on a commented edit I based on the original code, there must’ve been a change somewhere that affected the timings but the comments weren’t updated.