Julia seems an order of magnitude slower than Python when printing to the terminal, because of issue with "sleep"

Hi all,

I’m running into couple of issues and a bottleneck when printing characters to the screen in Julia. This is because I was trying to add some new features to TerminalUserInterfaces.jl: https://github.com/kdheepak/TerminalUserInterfaces.jl

I’m trying to recreate the following Python code in Julia:

import os
import time


def main():
    delay = 1.0 / 1200
    while True:
        byte = os.read(0, 1)
        if not byte:
            break
        os.write(1, byte)
        time.sleep(delay)


if __name__ == "__main__":
    main()

This is code from a PyCon keynote talk by Brandon Rhodes: Keynote by Brandon Rhodes - YouTube

The way the speaker runs this code is by using the following:

python animation.py < animation.txt

Here’s the animation.txt file if someone is interested in reproducing the animation:
https://gist.github.com/kdheepak/ff6b124a8511537f4266142c0a10bb52#file-animation-txt

Firstly, I’m not able to write the code in Julia the way the speaker did. When I run julia animation.jl < animation.txt, it appears that bytesavailable(stdin) is never greater than zero (?). I’m not sure what I’m doing wrong here. I’ve even tried using some Stdlib functions but those error for some reason:

# import REPL
# term = REPL.Terminals.TTYTerminal("xterm", stdin, stdout, stderr)
# REPL.Terminals.raw!(term, true) # This line and the following line throws an error
# Base.start_reading(stdin)  

function main()
    delay = 1.0 / 1200
    while bytesavailable(stdin) > 0
        b = UInt8[0]
        readbytes!(stdin, b)
        write(stdout, b[1])
        sleep(delay)
    end
end

main()

The first few lines are based on this stack overflow answer:

https://stackoverflow.com/a/60956065/5451769

These are the errors I get when I uncomment the first 3 lines or the 4th line:

ERROR: LoadError: MethodError: no method matching check_open(::IOStream)
Closest candidates are:
  check_open(::Union{Base.LibuvServer, Base.LibuvStream}) at ~/Applications/Julia-nightly.app/Contents/Resources/julia/share/julia/base/stream.jl:386
  check_open(::Base.Filesystem.File) at ~/Applications/Julia-nightly.app/Contents/Resources/julia/share/julia/base/filesystem.jl:114
Stacktrace:
 [1] raw!(t::REPL.Terminals.TTYTerminal, raw::Bool)
   @ REPL.Terminals ~/Applications/Julia-nightly.app/Contents/Resources/julia/share/julia/stdlib/v1.9/REPL/src/Terminals.jl:138
 [2] top-level scope
ERROR: LoadError: MethodError: no method matching start_reading(::IOStream)
Closest candidates are:
  start_reading(::Base.BufferStream) at ~/Applications/Julia-nightly.app/Contents/Resources/julia/share/julia/base/stream.jl:1541
  start_reading(::Base.LibuvStream) at ~/Applications/Julia-nightly.app/Contents/Resources/julia/share/julia/base/stream.jl:832
Stacktrace:
 [1] top-level scope

I’m able to get around this issue by writing the code this way:

function main()
    delay = 1.0 / 1200
    data = readavailable(stdin)
    for b in data
        write(stdout, b[1])
        sleep(delay)
    end
end

main()

This reads all the data from stdin at the beginning and then write’s out the stdout.

The thing is that Julia is several orders of magnitude slower than Python to print out to the terminal. Here’s the timing result when commenting out the sleep functions in both Python and Julia.

python animation.py < animation.txt  0.04s user 0.03s system 80% cpu 0.082 total
julia animation.jl < animation.txt  0.70s user 0.27s system 85% cpu 1.136 total

Python finishing in 0.04 seconds but Julia takes 0.70 seconds. This makes certain kinds of animations / widgets in the terminal untenable in Julia when I’m implementing them in TerminalUserInterfaces.jl.

I could be wrong about this, but I think this has to do with buffered vs unbuffered writes to stdout, and Julia defaulting to blocking and waiting for the write function to finish outputting text to the terminal.

I guess my question is how to get Julia to print and “return to execution” as fast as possible, i.e. without waiting for the character or byte to actually be written out to the terminal? I need to print something like this sequentially for TerminalUserInterfaces.jl so I don’t think using async, Threads, or multiprocessing would work. Any suggestions?

I mean, if you run a script in this way you include:

  • the startup time of Julia
  • the time Julia needs to compile your code

On my computer the startup time is:

ufechner@tuxedi:~$ time julia -e ''

real	0m0,191s
user	0m0,139s
sys	    0m0,084s

The compilation time is a bit more tricky to measure because of your I/O functions, but you could launch Julia like this:

julia --compile=min 

to find out if the compilation time is relevant at all.

Some references: JuliaLang/julia#36639, JuliaLang/julia#43176, Why is printint to a terminal slow (Discourse).

Thanks for the references. So apart from writing to a temporary IOBuffer, it appears that there’s no way around this? Maybe there’s a way I can use ccall to bypass Julia’s calls?

@ufechner7 this isn’t a precompilation issue btw. Even if I open a REPL and include the file, and call main multiple times, I get the same slow performance.

I’m surprised this doesn’t work when I run julia animation.jl < animation.txt:

function main()
    delay = 1.0 / 1200
    data = readavailable(stdin)
    for b in data
        @ccall write(0::Cint, UInt8[b[1]]::Ptr{UInt8}, 1::Cint)::Cint
        sleep(delay)
    end
end

main()

It just hangs with no output in the terminal. But this works in the REPL:

A is the character being printed and 1 is the return value.

On my system write (2) has a different signature than the one you’re using. Also stdout has fd=1, I’m not sure how fd=0 could ever work as that’s stdin?

The following works fine for me

function main()
    delay = 1.0 / 1200
    data = readavailable(stdin)
    for b in data
        @ccall write(1::Cint, UInt8[b[1]]::Ptr{Cvoid}, 1::Csize_t)::Cssize_t
        sleep(delay)
    end
end

main()

btw. 0.04s vs 0.7s is not “several orders of magnitude”

it is 1 order of magnitude

“several” would be 70s or 700s or more

6 Likes

Thanks for the responses, that does appear to resolve my printing issue. I must have used 0 in the REPL but 1 in my script that I copy pasted here and typo’d while editing, or something like that.

Does it display the animation for you? For me it just prints the characters to the terminal:

Yes it works. But I needed to replace e[ from the pasted file with the escape sequence \e[

Thanks! That runs. However, that is still slower than Python and I don’t understand why. I thought using @ccall write would bypass Julia’s write and the buffering issues.

Since I always like a good deep dive into things, I’ve modified the code with 3 different versions and added some different timing code to see where the time is actually spent. I’m printing to stderr instead of stdout because it makes it easier to get the output of the timing code without some extra redirect_std* shenanigans.

The first version is just your @ccall version, the second version is the naive “just print to the stream one by one” version (except I reduced the redundant b[1] indexing - iterating over the byte array already gives you single bytes) and the third version is a @ccall version that doesn’t allocate a new UInt8[] each iteration.

Code
function main()
    delay = 1.0 / 1200
    data = readavailable(stdin)
    for b in data
        @ccall write(2::Cint, UInt8[b[1]]::Ptr{Cvoid}, 1::Csize_t)::Cssize_t
        #sleep(delay)
    end
end

function main2()
    delay = 1.0 / 1200
    data = readavailable(stdin)
    for b in data
        write(stderr, b)
        #sleep(delay)
    end
end

function main3()
    delay = 1.0 / 1200
    data = readavailable(stdin)
    slot = [0x0]
    for b in data
        slot[] = b
        @ccall write(2::Cint, slot::Ptr{Cvoid}, 1::Csize_t)::Cssize_t
        #sleep(delay)
    end
end

I invoke this file (test.jl) like the following:

time julia -L test.jl -e '@time main()' < animation.txt 2>/dev/null
time julia -L test.jl -e '@time main2()' < animation.txt 2>/dev/null
time julia -L test.jl -e '@time main3()' < animation.txt 2>/dev/null

This does nothing more than load the file into julia (-L test.jl) and runs a timing expression (-e '@time main()'). I redirect stderr to /dev/null to not mangle the output, but you can leave that out if you want to see it happen. I get these results:

$ time julia -L test.jl -e '@time main()' < animation.txt 2>/dev/null
  0.047315 seconds (76.72 k allocations: 3.360 MiB, 74.32% compilation time)

real	0m0.351s
user	0m0.356s
sys	0m0.277s
$ time julia -L test.jl -e '@time main2()' < animation.txt 2>/dev/null
  0.028545 seconds (56.13 k allocations: 2.104 MiB, 93.53% compilation time)

real	0m0.331s
user	0m0.364s
sys	0m0.242s
$ time julia -L test.jl -e '@time main3()' < animation.txt 2>/dev/null
  0.035286 seconds (56.12 k allocations: 2.088 MiB, 74.99% compilation time)

real	0m0.350s
user	0m0.406s
sys	0m0.226s

So it seems like we have our usual ~0.3s julia startup time, followed by about ~0.03s total execution time, of which ~75%-94% is compilation of the requested function, meaning about 0.0075s time spent reading from stdin and printing it back out.

EDIT:

This is on a somewhat recent commit of the repo:

julia> versioninfo()
Julia Version 1.9.0-DEV.171
Commit 842c27fad3 (2022-03-11 06:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores

This timing thing makes me want to have a “% allocations in compilation” blurb from @time though… As the code is right now, we’d have to compile the function ahead of time to figure out how much is allocated while running/reading from stdin directly (or modify the code to not read from stdin directly).

5 Likes

Thanks for looking into this @Sukera, I always enjoy your deep dives into the performance and benchmarking of code (especially from AoC).

The same code in Python runs faster than the numbers you shared, and I want to figure out how to make the Julia code run as fast:

$ python animation.py < animation.txt 2> /dev/null  
0.04s user 0.03s system 79% cpu 0.084 total

I think like @fredrikekre posted, this has to do with buffered and unbuffered outputs / number of sys calls being made when write is being called.

Edit: I didn’t read your numbers correctly. I think when you run >dev/null, it ends up being a lot faster? Maybe not printing to stdout doesn’t cause this issue.

Yes it certainly should bypass Julia’s IO layer. There will be some unwanted consequences from invoking write() directly — I think the task waiting on the write() will block but won’t yield correctly.

It doesn’t have anything to do with buffering. The timings I’ve showed you above clearly show that the startup time of ~0.3s and the compilation time make up the difference and why it’s slower at all. If you take just the time spent for running the function that actually does the work (i.e., you time properly) you get 0.0075s, at worst on my machine.

Please note the extra output above - the 0.04 figure is the result of the @time call, timing the function itself. The difference between that and the user time is the julia startup time, which is independent of the code you run.

Yes it does, but the I/O layer is not that slow here. It’s not a bottleneck.

No, that’s not an issue, at all. If I just print to stdout like you did I get only slightly slower timings, but still faster than python for printing itself:

$ time julia -L test.jl -e '@time main()' < animation.txt
[...] # animation output
  0.071099 seconds (76.72 k allocations: 3.360 MiB, 54.37% compilation time)

real	0m0.370s
user	0m0.395s
sys	0m0.257s

$ time julia -L test.jl -e '@time main2()' < animation.txt
[...] # animation output
  0.116960 seconds (102.97 k allocations: 3.230 MiB, 23.82% compilation time)

real	0m0.417s
user	0m0.431s
sys	0m0.265s

$ time julia -L test.jl -e '@time main3()' < animation.txt
[...] # animation output
  0.053860 seconds (56.12 k allocations: 2.088 MiB, 50.68% compilation time)

real	0m0.368s
user	0m0.374s
sys	0m0.274s

The biggest bottleneck is (as usual for these microbenchmarks of shell redirection) julia startup time, NOT execution/syscall time.

The only problem I can per se see is the spike in latency/allocations in main2(), i.e. the “julia native” version, which should already be tracked by this issue:

which is non-trivial to fix.

2 Likes

@Sukera

This is what I’m seeing:

The top right is Julia and the bottom right is Python. You can see that for me, Python proceeds through the animation waayy faster than Julia.

Are you saying that for you Julia is printing the animation faster than Python is printing the animation?

Is this an operating system / terminal issue maybe?

No, that’s not what I’m saying. I’m saying that the printing itself is not slower. It’s the 0.3s julia startup time that make up the difference, which is a long known & well improved problem.

No, it isn’t. As I said above, it’s due to startup time, code loading & compilation. Python is extremely well optimized in that specific use case of terminal based scripting due to it being more than 20 years old by now and having been in use for terminal based scripting for A LOT of time by now. It’s also interpreted so there isn’t even any compilation going on.

Wow, you are right. Julia is faster than Python for me too when I remove the sleep. But when I add the sleep function back in, Julia is slower. I guess my previous understanding was all wrong but I don’t understand why adding the same sleep in Julia and in Python would result in slower Julia code?

This is just a guess (meaning I haven’t tested it), but it’s very possible that the difference with sleep is due to regular sleep yielding to the julia task scheduler internally. My guess is that time.sleep in python doesn’t do/have anything similar to that. It’s also a little tricky to just swap in a @ccall version since the sleep function in C takes at least a second and nanosleep has a different API.

Another difference may be that julia’s sleep has a minimum sleep time of 0.001 seconds i.e. a millisecond - the python docs don’t say anything about a minimum sleep time for time.sleep (or I couldn’t find it). Your 1.0 / 1200 = 0.0008333... is less than that, so it gets rounded up.

Yeah, I’m just pointing out that there’s some subtle and problematic downsides to bypassing Julia’s IO layer. This isn’t a problem for small test scripts. But it could be quite a problem for a larger piece of code if there’s other tasks going on concurrently.

2 Likes