IOError: could not spawn <...>: argument list too long

I have a script that looks like this:

Threads.@threads for x in ...
    ...
    out = Pipe()
    err = Pipe()
    cmd = Cmd(`<executable_file>`, ignorestatus=true, dir=tmp_dir)
    process = run(pipeline(cmd, stdin=joinpath(tmp_dir, "commands"), stdout=out, stderr=err), wait=true)
    ...
end

Generally it works fine, but from time to time (after several hours) I get this exception:

ERROR: LoadError: TaskFailedException:
IOError: could not spawn setenv(`/home/aplavin/.../<executable_file>`; dir="/tmp/<temp_dir_name>"): argument list too long (E2BIG)
 [1] _spawn_primitive(::String, ::Cmd, ::Array{Any,1}) at ./process.jl:99
 [2] #550 at ./process.jl:112 [inlined]
 [3] setup_stdios(::Base.var"#550#551"{Cmd}, ::Array{Any,1}) at ./process.jl:196
 [4] _spawn at ./process.jl:111 [inlined]
 [5] _spawn(::Base.CmdRedirect, ::Array{Any,1}) at ./process.jl:139 (repeats 3 times)
 [6] run(::Base.CmdRedirect; wait::Bool) at ./process.jl:439
 [7] <line with run(pipeline(...)) from the code above>

Note that there are no command-line arguments in this run call, and the executable name is short. Moreover, when I run my script again, it skips already processed items in the for loop (by design), and successfully completes the iteration for the same x it failed before. Then after some time I get the same exception, and have to restart the script again.

Unfortunately, I could not make an MWE for this, and not even sure whether it is related to threads - the same loop without threads runs for longer without errors, but maybe it is just a matter of waiting more.

Any idea why this may happen, and how to fix/debug this error?

Turns out I can reliably reproduce a state when Julia cannot execute any external command, i.e. even run(`ls`) raises the aforementioned error IOError: could not spawn `ls`: argument list too long (E2BIG). The reproduction basically consists of running the script described in the first post for some time.
I still cannot provide an MWE, but will gladly try to debug the issue is anyone provides some pointers on what to try when run(`ls`) fails.

This happens for latest Julia (1.5.2) on linux.

This will probably take a while and use a good bit of disk space, but you could run this under strace, wait for it to fail and terminate, and then check the strace log for the failing syscall. strace will show you the exact command and arguments used.

1 Like

For now I tried attaching strace to the Julia process when it’s in the “ls-failing” state. So I attach strace, execute run(`ls`), but don’t see anything like “spawn” in the log. What exactly to look for?

I found vfork in the strace log. The seemingly-relevant part around:

write(16, "\1\0\0\0\0\0\0\0\207@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 44) = 44
epoll_ctl(3, EPOLL_CTL_MOD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0
epoll_pwait(3, [], 1024, 0, NULL, 8)    = 0
epoll_pwait(3, [], 1024, 0, NULL, 8)    = 0
epoll_pwait(3, [{EPOLLIN, {u32=16, u64=16}}], 1024, -1, NULL, 8) = 1
read(16, "\1\0\0\0\0\0\0\0\210@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 67369) = 1239
epoll_pwait(3, [], 1024, 0, NULL, 8)    = 0
rt_sigprocmask(SIG_BLOCK, NULL, [INT QUIT ABRT USR1 TERM], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [INT QUIT ABRT USR1 TERM], 8) = 0
vfork()                                 = 16526
rt_sigprocmask(SIG_SETMASK, [INT QUIT ABRT USR1 TERM], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [INT QUIT ABRT USR1 TERM], 8) = 0
rt_sigprocmask(SIG_SETMASK, [INT QUIT ABRT USR1 TERM], NULL, 8) = 0
mincore(0x7f0fb9c84000, 4096, [1])      = 0
read(17, "\7", 1)                       = 1
write(18, "2", 1)                       = 1
mincore(0x7f0fb9c85000, 4096, [1])      = 0
read(17, "2", 1)                        = 1
write(18, "\0", 1)                      = 1

This corresponds to the failing run(ls). As I’m not familiar with stracing, can you please help me make some sense out of it?

Ok good, so what you’ll need to do is to run strace with either the -f or -ff flags to see what the forked process is doing.

That’s the strace log of a forked process:

fcntl(0, F_GETFD)                       = 0
fcntl(0, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(1, F_GETFD)                       = 0
fcntl(1, F_GETFL)                       = 0x1 (flags O_WRONLY)
fcntl(2, F_GETFD)                       = 0
fcntl(2, F_GETFL)                       = 0x1 (flags O_WRONLY)
rt_sigaction(SIGHUP, {sa_handler=SIG_DFL, sa_mask=[HUP], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b84ff0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=[QUIT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGILL, {sa_handler=SIG_DFL, sa_mask=[ILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b851d0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGTRAP, {sa_handler=SIG_DFL, sa_mask=[TRAP], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_IGN, sa_mask=[TRAP], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, 8) = 0
rt_sigaction(SIGABRT, {sa_handler=SIG_DFL, sa_mask=[ABRT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b851d0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=SIG_DFL, sa_mask=[BUS], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b857f0, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGFPE, {sa_handler=SIG_DFL, sa_mask=[FPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b850f0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGUSR1, {sa_handler=SIG_DFL, sa_mask=[USR1], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b851d0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[SEGV], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b857f0, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGUSR2, {sa_handler=SIG_DFL, sa_mask=[USR2], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b85460, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=SIG_DFL, sa_mask=[ALRM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSTKFLT, {sa_handler=SIG_DFL, sa_mask=[STKFLT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[CHLD], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6c4d222, sa_mask=~[KILL STOP RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigaction(SIGCONT, {sa_handler=SIG_DFL, sa_mask=[CONT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTSTP, {sa_handler=SIG_DFL, sa_mask=[TSTP], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTTIN, {sa_handler=SIG_DFL, sa_mask=[TTIN], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTTOU, {sa_handler=SIG_DFL, sa_mask=[TTOU], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGURG, {sa_handler=SIG_DFL, sa_mask=[URG], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGXCPU, {sa_handler=SIG_DFL, sa_mask=[XCPU], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGXFSZ, {sa_handler=SIG_DFL, sa_mask=[XFSZ], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGVTALRM, {sa_handler=SIG_DFL, sa_mask=[VTALRM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPROF, {sa_handler=SIG_DFL, sa_mask=[PROF], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGWINCH, {sa_handler=SIG_DFL, sa_mask=[WINCH], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGIO, {sa_handler=SIG_DFL, sa_mask=[IO], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPWR, {sa_handler=SIG_DFL, sa_mask=[PWR], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSYS, {sa_handler=SIG_DFL, sa_mask=[SYS], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc7d68ace30}, {sa_handler=0x7fc7d6b851d0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fc7d6a4a140}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
execve("/home/aplavin/.julia/artifacts/595a9c11691838de1f6ed43bfe902f7885cb366a/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/home/aplavin/.julia/artifacts/4b3f429bc2c468e7cb44b1670a262b13be4528f6/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)

<... same lines repeating many times ...>

execve("/home/aplavin/.julia/artifacts/4b3f429bc2c468e7cb44b1670a262b13be4528f6/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/home/aplavin/.julia/artifacts/595a9c11691838de1f6ed43bfe902f7885cb366a/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/home/aplavin/.julia/artifacts/4b3f429bc2c468e7cb44b1670a262b13be4528f6/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/home/aplavin/.julia/artifacts/595a9c11691838de1f6ed43bfe902f7885cb366a/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/home/aplavin/.julia/artifacts/4b3f429bc2c468e7cb44b1670a262b13be4528f6/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/home/aplavin/anaconda3/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/sbin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/sbin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/sbin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 ENOENT (No such file or directory)
execve("/bin/ls", ["ls"], 0x23ab620 /* 40 vars */) = -1 E2BIG (Argument list too long)
exit_group(127)                         = ?
+++ exited with 127 +++

Are you doing any manipulations of the env. vars during runtime? I suspect that you might have too many env. vars set (and their combined length is too long) leading to E2BIG.

Don’t think there are any env modifications at all. I never even access ENV in the code.

Now I checked ENV, and turns out that ENV["PATH"] is very long: it contains many repetitions of two directories, the ones in the strace log above. One of them points to a julia artifact used in the script, and the other points to a completely unrelated artifact. But I don’t append anything to PATH myself, so where could these repetitions come from?

They’re likely automatically set by Julia’s artifact system to ensure binaries and libraries provided by artifacts are accessible to external commands. About how long are these paths (in terms of bytes)?

Each (of the two) path is short, e.g. “/home/aplavin/.julia/artifacts/4b3f429bc2c468e7cb44b1670a262b13be4528f6/bin/”. However, the same two paths are repeated many many times in ENV["PATH"], so length(ENV["PATH"]) > 17000.

Ok that’s definitely the issue then. I’m not familiar with how this system works, but maybe @giordano has some ideas?

Not sure what we are talking about. Unless I missed something, you aren’t running an ExecutableProduct in a JLL package? If so, I have no idea what’s setting your PATH.

I do run an executable from a jll package many times in my script (see 1st post for description), and this leads to a long ENV["PATH"]. I would guess that running a jll package somehow appends to PATH and doesn’t always revert the changes in a multithreaded/async scenario.

No mention to JLLs as far as I can see :slightly_smiling_face:

Can you please provide a MWE? Hard to understand what’s happening without having a simple code to look at

Indeed, I didn’t think that jlls were relevant when writing the 1st post. Anyway, for now I solved the issue by specifying adjust_PATH=false and adjust_LIBPATH=false when running a jll executable. This way no changes are done to ENV and ENV["PATH"] doesn’t grow.

Thanks for helping to debug this! I wouldn’t ever guess that the error has something to do with ENV values, and that those values may grow unexpectedly.

I think that in Julia v1.6 the thread-safe addenv function is used, while in previous versions withenv is used, which is known to be thread-unsafe: https://github.com/JuliaLang/julia/issues/34726

Oh, interesting - this looks like the underlying cause. So JLL packages will work without this issue in the next julia version? Or some manual modifications will be needed?