Stdin/stdout 10x slower than file open


#1

I came across a curious problem writing some code converting files. It turned out to be slower than a previously used scripting language, and not approaching (expected) C speed at all. I could get it to approximate C speed by using open instead of stdin/stdout. I don’t see a good reason for stdin/stdout being this much slower, and would like to get the same speed for stdin/stout (in C stdin/stdout seems generally even a tiny bit faster).
Any ideas? Below test code:

To create a large enough test file:

f=open("csvfile.csv","w")
for i = 1:10000000
	write(f,"short,\"with,comma\",\"a bit\tlonger\",\"very\nlong indeed\"\n")
end
close(f)

stdin/stdout version (csv2tsv_stdout.jl)

function csv2tsv()
	f=open("tmp/csvfile.csv")
	o=open("tmp/csvfile.tsv","w")
	prev = '\n'
	state = 'c'
	while(!eof(f)) 
		line = readline(f)
		for c in line
			if (c == '"')
				if (state == 'c')
					state = 'o'
				else
					state = 'c'
				end
				if (prev == '"') write(o,'"') end
			elseif (c == ',')
				if (state == 'c')
					write(o,'\t')
				else
					write(o,',')
				end
			elseif (c == '\t')
				write(o,'\\')
				write(o,'t')
			else
				write(o,c)
			end
			prev = c
		end
		if (state == 'c')
			write(o,'\n')
		else
			write(o,'\\')
			write(o,'n')
		end
		prev = '\n'
	end
	close(o)
	close(f)
end
csv2tsv()

file version (csv2tsv_file.jl)

function csv2tsv()
	f=open("tmp/csvfile.csv")
	o=open("tmp/csvfile.tsv","w")
	prev = '\n'
	state = 'c'
	while(!eof(f)) 
		line = readline(f)
		for c in line
			if (c == '"')
				if (state == 'c')
					state = 'o'
				else
					state = 'c'
				end
				if (prev == '"') write(o,'"') end
			elseif (c == ',')
				if (state == 'c')
					write(o,'\t')
				else
					write(o,',')
				end
			elseif (c == '\t')
				write(o,'\\')
				write(o,'t')
			else
				write(o,c)
			end
			prev = c
		end
		if (state == 'c')
			write(o,'\n')
		else
			write(o,'\\')
			write(o,'n')
		end
		prev = '\n'
	end
	close(o)
	close(f)
end
csv2tsv()

And the times

time julia julia/csv2tsv_stdout.jl < csvfile.csv > csvfile.tsv
real    0m57.462s
user    0m57.105s
sys     0m0.580s

time julia csv2tsv_file.jl
real    0m7.463s
user    0m4.917s
sys     0m0.673s

# C version
time csv2tsv  < csvfile.csv > csvfile.tsv
real    0m6.118s
user    0m3.577s
sys     0m0.469s

#2

Both versions seem to be the same code?
I’ve suggested using IOBuffers() for this sort of thing before, and it did shave about half off the long versions run (untested 57 sec) -> 27 22.7 sec in memory


#3

Files are buffered, whereas stdin/stdout are TTY devices that have no buffering AFAIK: print('x') in a TTY sends the output to the terminal immediately.

(As @y4lu mentioned, you can use an IOBuffer to get buffered stdout output if you want.)


#4

I wonder if it might be calling julia from a command-line, start-up times are pretty good but would still add a moment if we’re timing it…? Or is that mostly negligable for the non-interactive mode?


#5

Oh, I didn’t notice that portion of your post; normally it’s not that meaningful to time Julia from the command line because it includes so much one-time overhead, and launching julia over and over is not an effective way to use the language.

Time Julia from within Julia. If you need to run Julia code on a whole bunch of files or something, don’t use the shell to launch one julia process per file; pass the whole list of files to Julia (or do all of your shell-like coding from within Julia) and push all the loops into a single julia process.


#6

This will probably be of use, argparse.jl


#7

Thanks for the suggestions so far (even though they do not solve the problem entirely yet). To explain further, I want the stdin/stdout functionality because I want to write programs that can be used in a unix pipe. This means it has to be run from the commandline. In a pipe there are no filenames to begin with, so argparse cannot help. I made the data set large enough that startup and compilation overhead is negligable, and both versions are called from the commandline anyway.

The buffering seems a useful direction: Coming from a unix and C background, I supposed that opened files and stdin/stdout are basically the same thing, but as you suggest in Julia they seem not to be: IOStream() vs Base.TTY. Adding buffering (IOBuffers) does not help enough to solve the problem. Are there other implementation differences? If directed to a terminal, it will be slow obviously, but when redirected to and from a file, or in a pipe, this is not actually a TTY and it should have the same speed as normal filing operations …


#8

Does referencing stdin and stdout instead of the temp files work?

  • ie readline(stdin), read and write direct to them, rather than try to open

#9

Can you

$ strace -o ./trace julia foo.jl <infile >outfile

to see how this buffers?

Because when julia is started with shell redirection, then this should be buffered. Otoh, the overhead of calling into C (which handles the buffering) might be killing you here, but let’s check first that it is not syscalls (which would be so much worse).


#10

I found the cause of the problem (and a partial solution). Looking at the strace (yes it was buffered) gave me a hunch to check the type of stdin/stdout when redirecting. Its type is indeed different (IOStream) from interactive use (Base.TTY). This in turn made me check for type (in)stability, and that seems to be the culprit. Using

f = stdin::IOStream
o = stdout::IOStream

solves the problem when redirecting. However, it gives an error when piping, stdout having yet another type in this case (Base.PipeEndpoint)

Allowing the function to specialize by giving stdin and stdout as parameters also seems to work (same speed as files):

function csv2tsv(f,o)
...
end
csv2tsv(stdin,stdout)

and without the error when used in a pipe.
However, although use in a pipe works, it becomes unbearably slow (~60 times slower than redirect to file): strace shows that the Base.PipeEndpoint stdout type does not buffer, and sprinkles 2 epoll_wait calls between each write (of one character)!
Does anybody know why this type shifting of stdin/stdout is happening? Redirecting and piping is quite essential (and extremely useful) on Unix systems, and this shifting type seems to be interfering badly with it.
It seems to me the standard streams should always simply be streams, regardless of the connection.


#11

This is not nice at all!

Simple reproduction:

julia> pp=open(`head -n 1`, "r+")
Process(`head -n 1`, ProcessRunning)

julia> write(pp, 1);
julia> write(pp, 2);
julia> write(pp, 3);
julia> write(pp, "\n");
julia> write(pp, "\n");
ERROR: IOError: write: broken pipe (EPIPE)

Expectation: Only send the data and error once the pipe buffer is full or closed or manually flushed (buffer might mean /proc/sys/fs/pipe-max-size bytes, which is 1 MB for me).

Alternatively, create and document options for buffering (options: no buffering, until newline, until full with option to shrink below system max). I would have expected a discussion of all available options in https://docs.julialang.org/en/v1/devdocs/stdio/ and/or https://docs.julialang.org/en/v1/manual/running-external-programs/ and/or https://docs.julialang.org/en/v1/manual/networking-and-streams/.

If these options exist, it would be your job to check whether stdin/stdout/stderr are pipes and enable proper buffering.It would be preferable if open of commands had proper buffering options; alternatively, one should set them after the fact (first open, then set buffering options).

I don’t understand whether libuv or the sendbuf field of PipeEndpoint is supposed to be responsible for buffering. In view of all the multithreading issues of lock-or-no-lock I suspect that libuv must do the buffering and sendbuf is dead code?

Also, I have committed the same sin of passing pipes to opened commands around like candy; but if writes are unbuffered, then this is hell. Thank you for making me aware of this problem!

I am tempted to open an issue for this, but will let it wait for another day on discourse in case this is a failure-to-RTFM problem on my side. Pinging @jameson due to commit history of streams.jl.

Also, I don’t really get libuv. Very short poking to see whether there is an obvious bad option set leaves me at a loss.

julia> nn=Ref(Cint(0));
julia> ccall(:uv_recv_buffer_size, Cint, (Ptr{Cvoid},Ptr{Cint}), pp.in.in.handle, nn)
-88
julia> ccall(:uv_send_buffer_size, Cint, (Ptr{Cvoid},Ptr{Cint}), pp.in.in.handle, nn)
-88
julia> ccall(:uv_is_writable, Csize_t, (Ptr{Cvoid},), pp.in.in.handle)
0x0000000000000001
julia> ccall(:uv_stream_get_write_queue_size, Csize_t, (Ptr{Cvoid},), pp.in.in.handle)
ERROR: ccall: could not find function uv_stream_get_write_queue_size

I presume that the negative value actually means 2GB, since “misreported by factor of two on linux” which wraps negative (or this is supposed to be an unsigned int, or size_t or the libuv documentation sucks). And I found no API for checking whether a pipe is set to be blocking.

PS. Opened an issue.


#12

There was an odd possibly related error that you may be able to make sense of. I haven’t seen it for a while -it was running remotely on a linux in julia 0.6, but it was along the lines of > Write access denied, when reading a file using csvread, but was fine if read as a string into a buffer then csvread(). Possibly a seek() error(?) as it shouldn’t have been writing anything


#13

No idea. I don’t see how such an error should be stemming from a lack of buffering.


#14

I think I put it down to not having the seekable=true or optional writable=true filestream properties that you get with the default iobuffers and most files normally, maybe just a weird file or something

If there’s no buffer then you can probably seek forward, but not back?