Julia seems an order of magnitude slower than Python when printing to the terminal, because of issue with "sleep"

I can agree with this statement. But a sleep function that works for low delays (a few ms) with a reasonable accuracy is useful in many contexts, for me in particular in the fields of simulation and control. If I want to toggle a pin on a Raspberry Pi for 3 ms, I can easily do this with Python, but not with Julia. I find this an annoying and unnecessary limitation of Julia. Well, there are work-arounds, luckily, you can always call C functions in Julia directly, but this is not beginner friendly.

3 Likes

There is Libc.systemsleep.

Well, it is blocking… But otherwise not too bad:

julia> @btime sleep(0.0005)
  1.260 ms (5 allocations: 144 bytes)

julia> @btime Libc.systemsleep(0.0005)
  508.423 μs (0 allocations: 0 bytes)
0

But why can’t the sleep function call systemsleep for the part of the argument that is below 1ms automatically?

Not user friedly.

Because systemsleep is doing a fundamentally different thing — unlike sleep, it doesn’t yield control to other Julia tasks on the same thread.

1 Like

But who cares? I mean, who is using Julias green threads in the first place? Not many people.

Asynchronous I/O is pretty common.

4 Likes

That is super easy & reliable to do. As I mentioned, julia has millisecond resolution on its Timer, provided by libuv:

julia> t = Timer(0, interval=0.003) do _
           @show "Trigger every 3ms"
       end

This will accumulate timer skew, but will trigger roughly every 3ms. It’s equivalent to just waiting x milliseconds at the end of a loop. You can go as low as every 1ms, but with those kinds of times you really want to keep your callback extremely short and not do much more than push some message in e.g. a Channel to have some other background task deal with it and then start a new timer with a correctly calculated “new” offset.

In fact, earlier today I ran a little analysis to check out what sort of resolution python3 even offers, since it’s not even documented anywhere (in contrast to julia!) or I just couldn’t find it in the time.sleep or asyncio.sleep docs. Additionally, the docs of both only talk about “seconds”, with no concern for higher precision given (other than time.sleep mentioning “you can pass in floats for higher precision”, without telling you how far you can go).

With the following script:

time.py
import asyncio
import timeit
import sys
import time

async def func(arg):
    await asyncio.sleep(arg)

f = float(sys.argv[1])

def main():
    asyncio.run(func(f))

def main_sleep():
    time.sleep(f)

runs = 1000
print("Target: ", sys.argv[1])
resmain = (timeit.timeit(main, number=runs) / runs)
ressleep = (timeit.timeit(main_sleep, number=runs) / runs)
print("Sleep async: %.10f" % resmain)
print("Sleep time: %.10f" % ressleep)
abssync = abs(f - resmain)
abstime = abs(f - ressleep)
print("Err abs async: %.10f" % abssync)
print("Err abs time: %.10f" % abstime)
print("Relative error (closer to 1.0 is better)")
print("async: x%.2f" % (resmain / f))
print("time: x%.2f" % (ressleep / f))

we can investigate just how much timer skew & relative error crops up and when it happens. I tested both time.sleep as well as asyncio.sleep, since I felt that testing a single-threaded, single core, non-green thread system (time.sleep) against a green thread system (julia) wasn’t really an apples to apples comparison. The results? Well check them yourself:

Python timing results
$ python3 time.py 0.01
Target:  0.01
Sleep async: 0.0117151672
Sleep time: 0.0104052220
Err abs async: 0.0017151672
Err abs time: 0.0004052220
Relative error (closer to 1.0 is better)
async: x1.17 # 10% error so far, seems good
time: x1.04 # 4% error seems negligible & due to non-realtime guarantees of my kernel/hardware

$ python3 time.py 0.001
Target:  0.001
Sleep async: 0.0017855162
Sleep time: 0.0012482233
Err abs async: 0.0007855162
Err abs time: 0.0002482233
Relative error (closer to 1.0 is better)
async: x1.79  # what's happening here? 80% error?!
time: x1.25 # uh-oh - 25% error at the resolution julia already guarantees?

$ python3 time.py 0.0001
Target:  0.0001
Sleep async: 0.0020051429
Sleep time: 0.0001590123
Err abs async: 0.0019051429
Err abs time: 0.0000590123
Relative error (closer to 1.0 is better)
async: x20.05 # well that isn't good..
time: x1.59  # slowly accumulating more error

$ python3 time.py 0.00001
Target:  0.00001
Sleep async: 0.0007320734
Sleep time: 0.0000666075
Err abs async: 0.0007220734
Err abs time: 0.0000566075
Relative error (closer to 1.0 is better)
async: x73.21 # asyncio just seems to give up
time: x6.66 # woah! why are we suddenly 6 times slower! I thought we'd keep our precision in 10µs requests

$ python3 time.py 0.000001
Target:  0.000001
Sleep async: 0.0001497367
Sleep time: 0.0000538176
Err abs async: 0.0001487367
Err abs time: 0.0000528176
Relative error (closer to 1.0 is better)
async: x149.74 
time: x53.82 # yeah, this isn't realtime either

And yes, my machine does say that I have a nanosecond precision clock available for querying. Personally, I prefer a system with documented guarantees & failure modes to one that just does something in the hopes of being close to right. At least in julia you can very easily do Libc.systemsleep or, with minimal structs to define, call nanosleep yourself:

julia> struct TimeSpec
    tsec::Cint
    tns::Clong
end

julia> @ccall nanosleep(Ref(TimeSpec(0, 5))::Ptr{TimeSpec}, C_NULL::Ptr{TimeSpec})::Cint

If you actually want to use this when not at the REPL, you will have to interact with the task system to make sure there is only your specific task running on your specific OS-thread, so you’ll have to spawn your task sticky on a specific thread. The only cost you’re still eating is FFI, which in properly compiled julia code shouldn’t be any more than in any other C program that links this dynamically.

What’s with the hyperbole? Just because you don’t use it, doesn’t mean noone does. The python community itself didn’t have proper multithreading and exclusively used green threads for the longest time.

If I recall correctly, even the threaded interface from Base.Threads uses the green threads Tasks. They’re just pinned & stickied to an OS thread to prevent migration & data shuffling. So as a matter of fact, I’d wager almost everyone doing multithreading in julia is using the green threads feature, even if they don’t notice it.

You’re really not making a good case for your argument here.

2 Likes

The Julia REPL uses two Tasks — one for the frontend and one for evaluating user code. So pretty much everyone is using these, whether they know it or not.

@kdheepak by the way, here’s an example of what happens when Julia’s IO layer is bypassed:

A normal read from stdin can be cleanly interrupted with ^C:

julia> read!(stdin, zeros(UInt8, 100))
^C

ERROR: InterruptException:
...

But a direct read() from the stdin file descriptor can only be interrupted unsafely with multiple presses of ^C:

julia> @ccall read(0::Cint, zeros(UInt8,100)::Ptr{Cvoid}, 100::Csize_t)::Cssize_t
^C^C^C^C^C^CWARNING: Force throwing a SIGINT
ERROR: InterruptException:
...
2 Likes

I hopped through a few more forums because it started to really bug me why a system would have multiple clocks with different tick rates. It made more sense to me for everything to run on one clock with a reasonably high rate for good time resolution. And I still don’t really know why, but I have a few more thoughts now:

  1. On a hardware level, higher tick rates means the processor has to do work (timer interrupt) more frequently. I think the hardware clocks are already doing this and we can’t do much about it, but tying the software to a higher rate clock will add even more overhead. I don’t know exactly how much more, but at some point you’d rather the computer spend more of its time and energy on your program than on keeping time. This is somewhat reminiscent of latency-throughput tradeoffs in the topic of garbage collectors.

  2. Something specific that bugged me was that Python’s time module has two monotonic, system-wide (counts sleep) timer functions: time.monotonic and time.perf_counter, the only apparent difference being that the latter has a higher resolution (depending on the hardware). The name perf_counter suggests it’s intended for measuring code performance, which would need high resolution over small time periods. PEP418, which introduces these timers, suggests that higher resolution clocks drift more, drift meaning an accumulating deviation from true time:

Different clocks have different characteristics; for example, a clock with nanosecond precision may start to drift after a few minutes, while a less precise clock remained accurate for days.

  1. Back to Julia, the resolution of performance measurers like @time or @btime indicate that it’s possible to have higher time resolution. But the asynchronous Base.sleep doesn’t seem like a good place for it. The number you put into it isn’t a hard guarantee because a task that is done sleeping cannot interrupt running tasks, it has to wait its turn in the tasks queue. If you want an accurate sleep period, you probably don’t want something that actively tries to keep the processor busy with other tasks. As an aside, async-await/coroutines recently arose in multiple languages as user-level cooperative multitasking, where you write turn-taking at specific points of your code. For true interrupts, you need preemptive multitasking, which almost all current operating systems opt for. A scheduler automatically decides the turn-taking, so order is much more unpredictable, and your control is much more indirect. You also have to start worrying about your subroutines being reentrant, or as my inexperience interpreted it, “can your code be interrupted at any point, outside your control, by your other code and still work as expected”.
3 Likes

I haven’t check python’s source, but I’d wager these map to CLOCK_MONOTONIC and CLOCK_REALTIME on linux respectively. There are a number of different clocks exposed, e.g. these exist on my machine in /usr/include/linux/time.h:

/*
 * The IDs of the various system clocks (for POSIX.1b interval timers):
 */
#define CLOCK_REALTIME          0
#define CLOCK_MONOTONIC         1
#define CLOCK_PROCESS_CPUTIME_ID    2
#define CLOCK_THREAD_CPUTIME_ID     3
#define CLOCK_MONOTONIC_RAW     4
#define CLOCK_REALTIME_COARSE       5
#define CLOCK_MONOTONIC_COARSE      6
#define CLOCK_BOOTTIME          7
#define CLOCK_REALTIME_ALARM        8
#define CLOCK_BOOTTIME_ALARM        9

each with different meanings & intended guarantees, though most of them seem to just return nanosecond resolution when queried with clock_getres.

As long as the sleep function of Julia does not get improved, at least the help of it should contain a reference to base.Libc.systemsleep with the hint to use it if a higher resolution than 1ms is needed.

8 Likes

With all the hardware variability, I think someone who understands all this stuff should make a whole package dedicated to timing and (synchronous) sleeping. Looking at libc.jl, Base.Libc.systemsleep is conditionally defined to ccall either usleep for Unix systems or Sleep for Windows, and there’s no quick function for giving details on the timer used. It’d be cool if we could get finer details on the clocks on our own OS and computer, and the more portable functions would pick the best available (and we can find out which).

Also I keep reading usleep is deprecated somewhere? Way too many unfamiliar things for me to really understand.

1 Like

Sounds like an easy PR for anyone with a GitHub account and an interest in the matter. The docstring is at https://github.com/JuliaLang/julia/blob/master/base/asyncevent.jl#L224 and can be edited directly in the web GUI, which creates a PR with minimal effort. Click the pencil icon at the upper right corner of the file to get started editing.

6 Likes

Sorry to revive this, but I found this thread on google and wanted to share my workaround that is both (1) non-blocking and (2) accurate:

function nonblocking_systemsleep(t)
   task = Threads.@spawn Libc.systemsleep($t)
   yield()
   fetch(task)
end

This is non-blocking because of the yield and since it uses Libc.systemsleep it’s more accurate than the built-in sleep:

julia> @btime nonblocking_systemsleep(1e-4);
  119.042 μs (5 allocations: 496 bytes)

julia> @btime sleep(1e-4)
  1.154 ms (4 allocations: 112 bytes)

Unfortunately it has some allocations and still isn’t 100% accurate (and doesn’t immediately regain control when the sleep ends). But for me the 1 ms sleep in the head worker was slowing things down, so this workaround was very much worth it.

6 Likes

This is great thanks!

1 Like

Just curious, does anybody know if there are any potential issues with this nonblocking_systemsleep? I am about to switch my package to it.

Specific questions:

  1. If --threads=1, would this end up blocking anything?
  2. Would @async be better than Threads.@spawn?

I actually got an overall 30% improvement (!) in speed from making this change. It just means the head node can sleep at shorter intervals between checking the workers. And evidently it turns out that sleeping for 1 millisecond rather than my requested 1 microsecond was bottlenecking things.

I think Julia Base may want to merge something like this, maybe as a keyword like sleep(1e-4, system=true). Or just switch to it as the default.

Yes, since systemsleep puts the OS-thread to sleep, i.e. no other code can run on the thread that executes this task while that systemsleep is blocking execution. This is also mentioned in the docstring:

help?> Libc.systemsleep
  systemsleep(s::Real)

  Suspends execution for s seconds. This function does not yield to Julia's scheduler and
  therefore blocks the Julia thread that it is running on for the duration of the sleep
  time.

No, because @async can cause tasks to be pinned to the current thread, possibly making contention of that thread much worse & subsequently limiting parallelism.

I’ve had some trouble with sleep being inaccurate too, but I nevertheless wouldn’t want the behavior of systemsleep to be the default. I’d much rather have the scheduler be more flexible & accurate.

Note also that the minimum time for sleep is 1e-3 of 0.001 (again, see the docstring of sleep), while you requested 1e-4. So it shouldn’t come as a surprise that sleep didn’t wake up before that. If you have tighter requirements than that, there’s nothing wrong with a small busy-loop like this:

function busy_sleep(duration::Real)
    duration <= 0.001 || throw(ArgumentError("Duration must be smaller than 0.001 - use `sleep` for longer durations to allow other tasks to execute while sleeping!"))

   t = time()
   while (time() - t) < duration
        # do nothing
   end
end
julia> @time busy_sleep(1e-4)
  0.000101 seconds

Of course, this comes at the cost of not allowing other things to run on the thread executing busy_sleep either, just like systemsleep.

2 Likes

you can also use the function sleep_ms() of my package GitHub - ufechner7/Timers.jl: Timers for Julia

1 Like

Just to check, this behavior doesn’t occur if it’s inside a Threads.@spawn though, right? (Which is the reason for nonblocking_systemsleep defined above). I guess the thing I am asking is whether you actually need a second thread or not for this to be nonblocking (my guess is: yes).

This is precisely what I want to avoid. Libc.systemsleep is nice since it can get down to ~10 microseconds and, if put into @spawn, it doesn’t block the main thread! Even if it blocks its own thread, it doesn’t use the CPU (unlike a busy loop).

Maybe what I can do is check Threads.nthreads() and if it’s greater than 1, I use the Threads.@spawn Libc.systemsleep trick; otherwise, I use the regular sleep. Like:

const USE_SYSTEMSLEEP = Threads.nthreads() > 1

function systemsleep(dt::Number)
    if USE_SYSTEMSLEEP
        task = Threads.@spawn Libc.systemsleep(dt)
        yield()
        fetch(task)
    else
        sleep(dt)
    end
end

So users running with threads = 1 would still have the sleep bottleneck, but users with threads > 1 can take advantage of the lightweight Libc version.

1 Like

That depends on whether the task created by @spawn is scheduled on a different thread or not. No matter what, the thread actually executing the task will be blocked & doesn’t participate in scheduling activity. It won’t be available for running other tasks.

That depends on what exactly you mean with “nonblocking”. Inherently, systemsleep is blocking, since it prevents the thread currently executing from doing other more useful work.

I’m not sure why exactly you want to avoid a busy loop here - in terms of available computational power for your actual work, the two are equivalent. Whether the OS is available to schedule other programs (as would happen with systemsleep, taking up that “unused” CPU time) or you are not relinquishing CPU & busy waiting, the effect is the same - your productive computation doesn’t run on that thread at all in either scenario.

I’m not sure that’s going to work - with thread adoption,Threads.nthreads() is not a constant, so storing it globally won’t really have the desired effect.

The Libc version isn’t necessarily lightweight either. On my machine (Linux), this ends up calling usleep, which also doesn’t guarantee any upper bounds on execution time:

DESCRIPTION
       The  usleep() function suspends execution of the calling thread for (at least) usec
       microseconds.  The sleep may be lengthened slightly by any system  activity  or  by
       the time spent processing the call or by the granularity of system timers.

I haven’t looked at the implementation on my machine, but I’d be very surprised if this wouldn’t busy-wait either for small enough durations.

On Windows, Libc.systemsleep ends up calling the Sleep function from Syncapi.h, which also takes at least 1ms, just like our regular sleep. The big downside of systemsleep compared to sleep on windows is of course that the former doesn’t allow other julia tasks to run, while the latter does.


Under the hood, pretty much all OS/Julia/Programming language level implementations of sleep rely on some form of scheduling & time slicing for their “sleeping”. The minimum time available for sleeping is dictated by how small that time slice can be - on Linux, this is distribution dependent, but usually somewhere in the microsecond range IIRC. On windows, this is usually a millisecond.

As a consequence, if you want to go below that you generally have to roll your own, ending up with busy waiting at the smallest level.

2 Likes