Deadlock in reading stdout from cmd

Hi,

When reading the stdout from an external program, we sometimes encounter a deadlock (if that’s the correct word). We can reproduce it with a very stripped down example, by just reading a large file (using cat) and reading the stream:

using FileIO

function load(s::Pipe)
    @show s
    a = read(s)
    @show summary(a)
end

open(load, `cat libLAS_1.2.laz`, "r")

Run this multiple times:

for i in `seq 1 1000`;
    do
            echo $i; julia test_deadlock.jl; echo $i
    done

When running this (using provided loop in bash), eventually the julia process will hang, not showing any output at all. Looking at a strace of both the child process and julia itself (using a mac here):

Julia Call graph:
    2698 Thread_5295470   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2698 start_task  (in libjulia.0.5.2.dylib) + 356  [0x1001e3a54]  task.c:254
    +   2698 jl_apply_generic  (in libjulia.0.5.2.dylib) + 1000  [0x1001c5608]  .julia_internal.h:211
    +     2698 ???  (in <unknown binary>)  [0x314039390]
    +       2698 ???  (in <unknown binary>)  [0x31403937a]
    +         2698 jl_apply_generic  (in libjulia.0.5.2.dylib) + 1000  [0x1001c5608]  .julia_internal.h:211
    +           2698 jlcall_wait_20332  (in sys.dylib) + 12  [0x101bf9f7c]
    +             2698 julia_wait_20332  (in sys.dylib) + 264  [0x101bf9ed8]  .event.jl:27
    +               2698 julia_wait_20334  (in sys.dylib) + 394  [0x101bf963a]  .event.jl:147
    +                 2698 julia_process_events_20337  (in sys.dylib) + 161  [0x101bf91b1]  .libuv.jl:82
    +                   2698 uv_run  (in libjulia.0.5.2.dylib) + 223  [0x1002a064f]  core.c:354
    +                     2698 uv__io_poll  (in libjulia.0.5.2.dylib) + 1815  [0x1002be5b7]  kqueue.c:160
    +                       2698 kevent  (in libsystem_kernel.dylib) + 10  [0x7fffbfdb8d96]

and

Cat Call graph:
    2762 Thread_5295491   DispatchQueue_1: com.apple.main-thread  (serial)
      2762 start  (in libdyld.dylib) + 1  [0x7fffbfc89235]
        2762 ???  (in cat)  load address 0x10e560000 + 0x16a0  [0x10e5616a0]
          2762 ???  (in cat)  load address 0x10e560000 + 0x19ca  [0x10e5619ca]
            2762 write  (in libsystem_kernel.dylib) + 10  [0x7fffbfdb97e6]

It seems to hang on waiting to read from the pipe, in this case kqueue. This occurs quickly on both Julia 0.5 and 0.6, both on Linux and Mac. On Windows I can’t reproduce it with using cat, but have encountered it with running other software writing to stdout.

Could you help us figure out whether this is a bug in Julia (libuv?) or if we’re using open wrong? Appreciate any feedback.

Cheers

2 Likes

Some further investigation and a slightly simpler testcase.

I used a 1 MiB testfile created like this:

write("data.txt", randstring(2^20 - 1) * "\n")

And instead of cat I now used dd to be able to control the block size.

Using the default block size of 512 bytes it crashes quite often.

for i in 1:1000
    readstring(`dd if=data.txt bs=512`)
end

However setting the block size to 1 byte, I haven’t managed to hang it once:

for i in 1:1000
    readstring(`dd if=data.txt bs=1`)
end

So this might be related. I also tried the output flags direct, nocache and sync, but it still hangs with these enabled. A block size of 1M (such that the file fits in 1 block), also hangs.

Interestingly it always seems to hang after 65536 (2^16) bytes. With bs=512 I get this dd output for a successful run:

2048+0 records in
2048+0 records out
1048576 bytes (1,0 MB, 1,0 MiB) copied, 0,00160257 s, 654 MB/s

And this output for a hang:

129+0 records in
128+0 records out
65536 bytes (66 kB, 64 KiB) copied, 9,47607 s, 6,9 kB/s

Well, using a pipeline with dd and a defined small blocksize seems to hide these issues:

open(load, pipeline(`cat libLAS_1.2.laz`,`dd bs=8`), "r")

Works for me over a thousand times, although @visr said he experienced one hang on bs=4.
The larger the blocksize, the more often I experience deadlocks. A blocksize of 64 instantly hangs for me.

The fact that it instantly hangs at 64kb is because of the maximum kernel pipe buffer size. It just can’t buffer more, and Julia seems to wait to receive more data (not taking the limit into account?). There is an old issue for this, but it’s closed now. https://github.com/JuliaLang/julia/issues/8789. Maybe this should just be documented.

I can’t say where things go wrong. On github we have:
Issue of writing to pipe on Mac: https://github.com/JuliaLang/julia/issues/20812
Issue of readandwrite deadlock: https://github.com/JuliaLang/julia/issues/18840
Issue of intermittent failing deadlock in tests: https://github.com/JuliaLang/julia/issues/14747

@Keno Could you shed some light on this?

At the moment it seems that pipes (and thus interaction with external programs) can’t be used reliably for anything that reads/writes more than a few kb.

I can reproduce if I run something like

while true
readstring(`dd if=/dev/zero bs=1048576 count=2`)
end

on all cores.

Can you file an issue? It seems like something is dropping an event notification.

Ok thanks for checking Jameson, I filed an issue:
https://github.com/JuliaLang/julia/issues/22832