Bug in sleep() function - main thread work affecting sleep duration on running tasks

I updated the OP with a version improved by @Benny, but I forgot to adapt the sleeper function to the version that is actually showcasing the issue (because the parameters proposed by @Benny were coincidentally avoiding the issue).

I will not go into more detail here - please check the updated OP MWE in a few minutes.

Sorry about that.

Here you have the version with work being replaced by Libc.systemsleep. Spoiler - it behaves the same: sleep internals are messing things up for the sleep function (the exact same outcome as the OP MWE version).

using Base.Threads, Dates

@info "started with $(nthreads()) threads"
@info "interactive threadpool: $(nthreads(:interactive))"
@info "default threadpool: $(nthreads(:default))"
println(repeat("*", 40))

function mainspin(s, c; withyield=false)
    @info "mainspin executing on thread $(threadid()) "
    r = 0.0
    counter = 0
    while counter < c
        Libc.systemsleep(s)
        counter += 1
        if withyield
            yield()
        end
    end
end

function sleeper(s, c, id)
    @info "sleeper $id executing on thread $(threadid()) "
    counter = 0
    @time "sleeper $id" while counter < c
        sleep(s)
        counter += 1
    end
end

function libcworker(s, c, id)
    @info "worker $id executing on thread $(threadid()) "
    r = 0.0
    counter = 0
    @time "worker $id" while counter < c
        Libc.systemsleep(s)
        counter += 1
        yield()
    end
end

begin
    # phase 1
    @info "phase 1: control run - no main thread work"

    # yields 10 times (total time ~10 seconds)
    task1 = @spawn libcworker(1.02, 10, 1) # 1.02 because weird @time printing overlap
    # yields 10 times (total time ~10 seconds)
    task11 = @spawn sleeper(1, 10, 1)

    wait(task1)
    wait(task11)

    # we waited for task1 and task11 to finish

    # phase 2
    println(repeat("*", 40))
    @info "phase 2: main thread work without yield"

    # yields 10 times (total time ~10 seconds)
    task2 = @spawn libcworker(1, 10, 2)

    # yields 10 times (total time ~10 seconds)
    task22 = @spawn sleeper(1, 10, 2)

    # we wait 5 seconds before starting to spin
    # the main thread
    sleep(5)

    # now we start spinning 
    # while task2 and task22 are already running
    mainspin(1, 7)


    # main thread work does not impact the real worker
    # total time ~10 seconds
    wait(task2)

    # main thread work heavily impacting the sleeper
    # total time >15 seconds
    wait(task22)

    # phase 3
    println(repeat("*", 40))
    @info "phase 3: main thread work with yield"

    # yields 10 times (total time ~10 seconds)
    task3 = @spawn libcworker(1, 10, 3)

    # yields 10 times (total time ~10 seconds)
    task33 = @spawn sleeper(1, 10, 3)

    # we wait 5 seconds before starting to spin
    # the main thread
    sleep(5)

    # now we start spinning with yield
    # while task3 and task33 are running
    mainspin(1, 7, withyield=true)


    # main thread work does not impact 
    # the real worker - total time ~10 seconds
    wait(task3)

    # main thread work (with yield) still impacting the sleeper
    # but we consistently get better times (total time ~13 seconds)
    wait(task33)
end

Outcome with Libc.systemsleep:

[ Info: started with 12 threads
[ Info: interactive threadpool: 6
[ Info: default threadpool: 12
****************************************
[ Info: phase 1: control run - no main thread work
[ Info: worker 1 executing on thread 7 
[ Info: sleeper 1 executing on thread 8 
sleeper 1: 10.020645 seconds (55 allocations: 1.562 KiB)
worker 1: 10.203365 seconds (7.26 k allocations: 535.814 KiB, 0.38% compilation time)
****************************************
[ Info: phase 2: main thread work without yield
[ Info: sleeper 2 executing on thread 16 
[ Info: worker 2 executing on thread 9 
[ Info: mainspin executing on thread 1 
worker 2: 10.004138 seconds (12.28 k allocations: 864.277 KiB, 0.22% compilation time)
sleeper 2: 16.967642 seconds (12.34 k allocations: 867.816 KiB, 0.13% compilation time)
****************************************
[ Info: phase 3: main thread work with yield
[ Info: worker 3 executing on thread 7 
[ Info: sleeper 3 executing on thread 9 
[ Info: mainspin executing on thread 1 
worker 3: 10.003031 seconds (37.10 k allocations: 2.621 MiB, 0.07% compilation time)
sleeper 3: 14.002402 seconds (2.67 k allocations: 249.821 KiB, 0.05% compilation time)

@mkitti, I updated the MWE and added improved printing - to make it easier for you to evaluate what is happening.

Also, please check out the Libc.systemsleep MWE that I added as a reply to @Benny.

Note: please be aware that the realworker is just a control - which yields similar to the sleeper - to make it easier to argue in favor of the sleep function behavior is dependent on the main thread activity. But it seems that sleep is just a symptom - the underlying root cause also affects HTTP.jl and who knows how many other packages.

Some might say that is not such a big deal - because people are not doing non-trivial computation on the main thread (or that those who do are not spawning tasks). ā€¦ Well, unless they do. But even if you do 1-second computation on the main threadā€¦ that might cause a compounded delay on many other tasks running on different threads (the more resources you have, the larger the potential impact).

If you follow the discussion and watch other experiments I created, youā€™ll see that even if the main is free - you can still get into problems when spawning many tasks (like HTTP.jl does when dealing with concurrent connections).

Here it is:

[FIXED / merged] And while doing the MWEs for the current issue, I did enough work to detect the @time macro is also messed up:

5 Likes

I am keeping this thread as a journal while the issue is still open.

The current (public) status seems to consist in this marked as multithreading, and there seems to be some awareness of the root cause:

Julia uses a libuv based event-loop under the hood. Processing certain things like Timers/IO depend on the event-loop being run regularly.
(sleep uses a libuv timer under the hood.)

Looking at jl_process_events it seems like the event loop is only run from tid == 0 or when _threadedregion is set.

jl_enter_threaded_region is only called from threading_run (which is the base function for @threads).

I am unsure why we still have this mechanism instead of allowing any thread to run the libuv event loop.

Not sure if there are internal/private discussions concerning this or how these issues are prioritized. I just hope this doesnā€™t get buried under thousands of open issues.

I am not saying this is one of the most urgent/important issues, but it is making a dent in the performance of any Julia package/program that is making serious usage of the async/multithreading model (where Timer/IO is used).

2 Likes

Update

Potentially, the following PR might be the answer to the bug report generated by the current topic.

This is slightly speculative, but it fixes #50643 which is a start. But it hasnā€™t been tested under heavy contention of the uv lock.
It also currently uses the normal mutex but we probably need to switch to a lock that has some form of fairness (maybe #50878 )

Remove in threaded region and add a thread that runs the UV loop by gbaraldi Ā· Pull Request #50880 Ā· JuliaLang/julia (github.com)

I didnā€™t have time to test if the above PR changes things - but Iā€™ll do that soon and leave an update here.

Also, here is a (new) related topic that touches the issue raised by the current topic.

3 Likes

Update - July - 2024

I revisited this today (after encountering the issue again in a real-world context).

julia 1.10.4 still has the issue.

4 Likes