Standard streams much slower in Windows than in Linux

While testing Gaston in Windows, I discovered that it was running much slower than in Linux. A simple plot in Windows could take a few seconds, while in Linux it is basically instantaneous. Some tests seemed to point to Julia <-> gnuplot communication, which runs over the standard streams stdin, stdout, and stderr.

I’d like to ask for help figuring out if this is inherent to Windows, if I’m doing something wrong, or if this something that can be improved in Julia.

To test, I wrote a simple “server” that reads one character from its stdin, and outputs responses through its stdout and stderr. I also wrote a “client” that writes to the server and reads the response, 1000 times. Then I used BenchmarkTools to compare the runtime in Windows and in Linux.

The server code is here and the client is here – the code is a bit longer than I feel comfortable inlining in this post, and I don’t see how to make it shorter.

To measure the code’s performance in Linux, I run it like this:

julia> using BenchmarkTools
julia> include("client.jl");
julia> pin, pout, perr = init();
julia> @btime runtest($pin, $pout, $perr)
  126.452 ms (86250 allocations: 5.97 MiB)

When I run the exact same code on Windows, I get:

julia> @btime runtest($pin, $pout, $perr)
  6.003 s (96354 allocations: 7.47 MiB)

or roughly 50x slower.

I’m using Arch Linux in a virtual machine:

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202* (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

I’m also using Windows 10, which hosts the Linux VM:

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

One more data point: I modified the client code to keep track of how many times the server’s streams are read, or how many times this loop is executed.

The client sends 1000 requests to the server, which in turn generates 2000 replies. Each reply ends with \n. In Linux, the loop is executed exactly 2000 times. In windows, it is executed around 2900 times.

While this might not completely explain the time difference, it indicates that, in Windows, multiple calls to readavailable are sometimes needed in order to fully read a stream that contains around 30 characters.

In case anybody is following along… I streamlined the test a little bit and profiled it. You need a file server.jl with these contents:

function server()
    while true
        s = String(readavailable(stdin))
        s[1] == 'q' && break
        println(stdout,"[Server via stdout]: received $(s[1])")
        println(stderr,"[Server via stderr]: received $(s[1])")

Then, run this code:

using Profile

function async_reader(io::IO)::Channel
    ch = Channel(1)
    task = @async begin
        buf = ""
        while isempty(buf) || buf[end] != '\n'
            data = String(readavailable(io))
            buf *= data
    bind(ch, task)
    return ch

function init()
    pin = Pipe()
    pout = Pipe()
    perr = Pipe()
    proc = run(pipeline(`julia -L server.jl`,
                        stdin = pin, stdout = pout, stderr = perr),
               wait = false)
    close.([,, pin.out])
    return pin, pout, perr

function runtest(times, pin, pout, perr)
    for k=1:times
        ch_out = async_reader(pout)
        ch_err = async_reader(perr)
        out = take!(ch_out)
        err = take!(ch_err)

p = init();
p = init();
@profile runtest(1000,p...)

In Linux, the resulting profile is

20  ./task.jl:259; (::getfield(REPL, Symbol("##26#27")){RE...
 20 ...lib/v1.1/REPL/src/REPL.jl:117; macro expansion
  20 ...lib/v1.1/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
   20 ./boot.jl:328; eval(::Module, ::Any)
    5 ./REPL[4]:3; runtest(::Int64, ::Pipe, ::Pipe, ::Pipe)
     5 ./strings/io.jl:165; write
      5 ./gcutils.jl:87; macro expansion
       5 ./io.jl:225; unsafe_write
        5 ./stream.jl:901; unsafe_write(::Base.PipeEndpoint, ::...
         5 ./stream.jl:845; uv_write(::Base.PipeEndpoint, ::Ptr...
          5 ./stream.jl:877; uv_write_async(::Base.PipeEndpoint...
    1 ./REPL[4]:5; runtest(::Int64, ::Pipe, ::Pipe, ::Pipe)
     1 ./REPL[2]:3; async_reader(::Pipe)
      1 ./task.jl:262; macro expansion
       1 ./boot.jl:384; Type
        1 ./boot.jl:384; Type
    1 ./REPL[4]:6; runtest(::Int64, ::Pipe, ::Pipe, ::Pipe)
     1 ./REPL[2]:3; async_reader(::Pipe)
      1 ./task.jl:262; macro expansion
       1 ./boot.jl:384; Type
        1 ./boot.jl:384; Type
    8 ./REPL[4]:7; runtest(::Int64, ::Pipe, ::Pipe, ::Pipe)
     8 ./channels.jl:315; take!
      8 ./channels.jl:362; take_buffered(::Channel{Any})
       8 ./event.jl:46; wait(::Condition)
        8 ./event.jl:256; wait()
         8 ./libuv.jl:100; process_events
    2 ./REPL[4]:8; runtest(::Int64, ::Pipe, ::Pipe, ::Pipe)
     2 ./channels.jl:315; take!
      2 ./channels.jl:362; take_buffered(::Channel{Any})
       2 ./event.jl:46; wait(::Condition)
        2 ./event.jl:246; wait()
         2 ./libuv.jl:98; process_events
561 ./task.jl:259; (::getfield(Main, Symbol("##5#6")){Pipe...
 561 ./REPL[2]:6; macro expansion
  561 ./io.jl:243; readavailable
   561 ./stream.jl:829; readavailable
    561 ./stream.jl:368; wait_readnb(::Base.PipeEndpoint, ::In...
     561 ./event.jl:46; wait(::Condition)
      538 ./event.jl:246; wait()
       538 ./libuv.jl:98; process_events
        1 ./stream.jl:523; uv_alloc_buf(::Ptr{Nothing}, ::UInt6...
         1 ./stream.jl:507; alloc_buf_hook(::Base.PipeEndpoint,...
          1 ./stream.jl:489; alloc_request(::Base.GenericIOBuff...
           1 ./iobuffer.jl:312; ensureroom
            1 ./iobuffer.jl:320; ensureroom
             1 ./array.jl:812; _growend!
      23  ./event.jl:256; wait()
       23 ./libuv.jl:100; process_events

In Windows, the profile is much longer. I’ll copy only the lines that correspond to ./task.jl:259:

2261 .\task.jl:259; (::getfield(Main, Symbol("##3#4")){Pipe,Channel{Any}})()
 2259 .\REPL[2]:6; macro expansion
  2259 .\io.jl:243; readavailable
   2259 .\stream.jl:829; readavailable
    1    .\stream.jl:367; wait_readnb(::Base.PipeEndpoint, ::Int64)
     1 .\stream.jl:715; start_reading
    401  .\stream.jl:368; wait_readnb(::Base.PipeEndpoint, ::Int64)
     400 .\event.jl:46; wait(::Condition)
      1   .\event.jl:245; wait()
       1 .\abstractarray.jl:863; isempty
        1 .\event.jl:266; ==
      389 .\event.jl:246; wait()
       389 .\libuv.jl:98; process_events
        3 .\stream.jl:516; uv_alloc_buf(::Ptr{Nothing}, ::UInt64, ::Ptr{Nothing})
        2 .\stream.jl:519; uv_alloc_buf(::Ptr{Nothing}, ::UInt64, ::Ptr{Nothing})
         1 .\sysimg.jl:18; getproperty
        3 .\stream.jl:523; uv_alloc_buf(::Ptr{Nothing}, ::UInt64, ::Ptr{Nothing})
        3 .\stream.jl:581; uv_readcb(::Ptr{Nothing}, ::Int64, ::Ptr{Nothing})
         3 .\stream.jl:564; (::getfield(Base, Symbol("#readcb_specialized#489")))(::Base.PipeEndpoint, :...
          3 .\event.jl:62; notify
           3 .\event.jl:62; notify
            3 .\event.jl:62; #notify#442
             3 .\event.jl:68; notify(::Condition, ::Nothing, ::Bool, ::Bool)
              3 .\event.jl:129; schedule
               3 .\event.jl:132; #schedule#443(::Bool, ::Function, ::Task, ::Nothing)
                3 .\sysimg.jl:19; setproperty!(::Task, ::Symbol, ::Nothing)
      2   .\event.jl:255; wait()
       1 .\event.jl:187; try_yieldto(::typeof(Base.ensure_rescheduled), ::Base.RefValue{Task})
       1 .\event.jl:199; try_yieldto(::typeof(Base.ensure_rescheduled), ::Base.RefValue{Task})
        1 .\sysimg.jl:19; setproperty!(::Task, ::Symbol, ::Nothing)
      8   .\event.jl:256; wait()
       8 .\libuv.jl:100; process_events
        2 .\stream.jl:523; uv_alloc_buf(::Ptr{Nothing}, ::UInt64, ::Ptr{Nothing})
         1 .\stream.jl:507; alloc_buf_hook(::Base.PipeEndpoint, ::UInt64)
          1 .\stream.jl:489; alloc_request(::Base.GenericIOBuffer{Array{UInt8,1}}, ::UInt64)
           1 .\iobuffer.jl:312; ensureroom
            1 .\iobuffer.jl:314; ensureroom
    1855 .\stream.jl:372; wait_readnb(::Base.PipeEndpoint, ::Int64)
     1855 .\stream.jl:737; stop_reading
    1    .\stream.jl:374; wait_readnb(::Base.PipeEndpoint, ::Int64)
     1 .\sysimg.jl:18; getproperty
    1    .\stream.jl:377; wait_readnb(::Base.PipeEndpoint, ::Int64)
     1 .\libuv.jl:57; unpreserve_handle(::Base.PipeEndpoint)
      1 .\abstractdict.jl:598; getindex
       1 .\abstractdict.jl:594; get
        1 .\sysimg.jl:18; getproperty
 1    .\REPL[2]:7; macro expansion
  1 .\strings\basic.jl:229; *
   1 .\strings\substring.jl:180; string(::String, ::String)
    1 .\strings\string.jl:60; _string_n
 1    .\int.jl:49; macro expansion
5    .\task.jl:294; task_done_hook(::Task)
 2 .\abstractarray.jl:1866; foreach(::getfield(Base, Symbol("##453#454")){Task}, ::Array{Any,1})
  2 .\task.jl:294; (::getfield(Base, Symbol("##453#454")){Task})(::getfield(Base, Symbol("##599#600"...
   2 .\channels.jl:197; #599
    2 .\channels.jl:230; close_chnl_on_taskdone(::Task, ::WeakRef)
 1 .\abstractdict.jl:598; getindex(::IdDict{Any,Any}, ::Any)
  1 .\abstractdict.jl:594; get
2    .\task.jl:295; task_done_hook(::Task)
 1 .\abstractdict.jl:605; delete!(::IdDict{Any,Any}, ::Any)

So, there is a lot more going on when running under Windows. I don’t understand the internals well enought to know if this points at a solution, though.


That’s a lot of waiting…

I don’t know anything about this code, but see the following note in the code:

@tshort Thanks for the link for that comment. It seems to me that the stream slowness is likely to be caused by Windows (or at least Windows + libuv) and not Julia. That’s bad news, since it means a fix is out of our hands.

I’d still like to ask if a developer who knows this part of Julia could confirm there’s no way to wring a little bit more performance under Windows.

1 Like