Bad performance of eachline() on STDIN

Hi
I am trying to use Julia for general purpose programming.
I am trying to use Julia to create a faster version of a Perl script, which is supposed to create a report on a large logfile, piped into STDIN. The logfile does not fit into RAM, hence the data needs to be read from STDIN line-by-line from the data stream.

My problem is that Julia (1.5.3) is twice as slow as Perl.

I was able to identify one of the main problems, which is the code to read line-by-line from the input stream.

I use the following test code only to illustrate the issue.

Perl:

#!/bin/perl -w
use strict;
use warnings;
my $counter = 0;
while(<>) {
    $counter++;
}
print "Perl:  Number of lines: $counter\n";

Julia:

#!/bin/julia
counter = 0
for line = eachline()
    global counter += 1
end
println("Julia: Number of lines: $counter")

Performance Perl:

# time (cat access.log | ./testloopspeed.pl)
Perl:  Number of lines: 19567100

real    0m15.251s
user    0m10.475s
sys     0m8.209s

Performance Julia:

# time (cat access.log | ./testloopspeed.jl)
Julia: Number of lines: 19557160

real    0m36.974s
user    0m27.348s
sys     0m6.368s

During runtime of above test code, I noticed that Perl is using much less RAM.

Language Resident-Memory Virtual-Memory
Perl 2,4MB 127MB
Julia 185MB 686MB

What am I doing wrong?

Please notice, that there is another topic on a very similar subject here:

Regards
Toni

1 Like

First thing, never write code directly on the global scope, write inside a function (even if it is a main taking no arguments and returning nothing) and call the function. Do not use global variables if possible (use variables local to this function instead). For things that will run a single time, and that compilation time may be the culprit try calling julia passing the --compile=min flag.

1 Like

Hello @Henrique_Becker
Thanks for your comments.
Do you think you general recommendations would make the Julia code run twice as fast?
Another question from me would be, how I would access STDIN from within a function? STDIN is a global handler, how would I pass a reference to it as argument to a function?

Thanks, Toni

1 Like

in this case, --compile=min only makes things slower:

~ » time (cat blah.log | julia16 --startup-file=no blah.jl)
Julia: Number of lines: 100000
( cat blah.log | julia16 --startup-file=no blah.jl; )  0.25s user 0.44s system 355% cpu 0.195 total
------------------------------------------------------------------------------------------------------------------------
~ » time (cat blah.log | julia16 --compile=min --startup-file=no blah.jl) 
Julia: Number of lines: 100000
( cat blah.log | julia16 --compile=min --startup-file=no blah.jl; )  4.30s user 0.45s system 111% cpu 4.267 total

Perl:

0.02s user 0.00s system 101% cpu 0.020 total

and putting things into function doesn’t help because you’re just running it once

The objective of putting things inside a function was to remove variables from global scope, so I do not understand you comment.

it doesn’t matter in this case, you will know once you’ve tried it.

function f()
    counter = 0
    for line = eachline()
        counter += 1
    end
    counter
end
println("Julia: Number of lines: $(f())")

~ » time (cat blah.log | julia16 --startup-file=no blah.jl) 
Julia: Number of lines: 100000
( cat blah.log | julia16 --startup-file=no blah.jl; )  0.25s user 0.44s system 360% cpu 0.191 total

I think in this case counter being a global variable only added negligible overhead compare to the really slow readline()

1 Like

I don’t think Julia can perform better with such a small test. There is barely any logic being executed. 99% of the perl code is probably C i.e. not interpreted. With Julia you have the compile time getting in the way so you need a file and logic big enough to dwarf that time.

I used this Julia file:

using Dates

function from_stdin()
    start = now()
    counter = 0
    for line = eachline()
        counter += 1
    end
    finish = now()
    println("STDIN read time ($counter lines): $(finish - start)");
end

from_stdin()

With 10 million lines perl is about 1/3 faster:

$ time ./ptest < 10_000_000.txt 
Perl:  Number of lines: 10000000

real	0m0.937s
user	0m0.893s
sys	0m0.043s

$ time julia read.jl < 10_000_000.txt 
STDIN read time (10000000 lines): 1214 milliseconds

real	0m1.430s
user	0m1.350s
sys	0m0.569s

When I go up to 1,000,000,000 lines the differences get narrower:

$ time julia read.jl < 1_000_000_000.txt 
STDIN read time (1000000000 lines): 113656 milliseconds

real	1m53.871s
user	1m44.570s
sys	0m9.574s

$ time ./ptest < 1_000_000_000.txt 
Perl:  Number of lines: 1000000000

real	1m36.142s
user	1m25.340s
sys	0m10.625s

My test files are not that big, the 10,000,000 line file is only 123MiB while the 1,000,000,000 line file is about 14GiB, so both can fit into memory.

Just for grins I moved the 14GiB file into /tmp which is a tmpfs file system (RAM) to remove any disk access times and I get:

$ time ./ptest < /tmp/1_000_000_000.txt 
Perl:  Number of lines: 1000000000

real	1m31.380s
user	1m28.799s
sys	0m2.390s

$ time julia read.jl < /tmp/1_000_000_000.txt 
STDIN read time (1000000000 lines): 108426 milliseconds

real	1m50.210s
user	1m45.692s
sys	0m3.708s
1 Like

No. However, I do not seem why to not use them either. It is easier to just use them from start than benchmarking the difference they make.

You can just pass stdin to any function, as a parameter, in my suggestion (creating a main function) you could pass it to main in the call. However, this will problaby make little difference, as pointed out by @jling, and I was interested in making counter a local variable, I was not even thinking about stdin.

eachline will collect each line into a string, which can be expensive if you only want the number of lines. Try this instead:

let counter = 0
    while !eof(stdin)
        counter += read(stdin, Char) == '\n'
    end
    println("Julia: Number of lines: $counter")
end
1 Like

No, I actually really need to perform some logic and use regex to extract substrings.
I tried running the full program with all logic in Julia, and it was twice as slow as Perl.
Then I tried to find the reason, and I suspect the reason for Julia being so slow is in eachline() itself.

1 Like

apparently “some logic” is too light to make a difference.

regex is not something Julia can shine in doing, we just call http://www.pcre.org/ so… again, Perl probably is just running C code here which is already optimized. This is similar to if you benchmark some linear algebra code, everyone will just be calling OpenBLAS/MLK

I initially had ran the Julia program with all logic in it, so that it would extract substrings, store them in Dictionaries, and at the end the reporting.
As a result, Julia was twice as slow as Perl. I investigated by reducing more and more logic, and came to the conclusion, that it must be eachline() that’s so slow.

14GB won’t fit into my memory. But why would anyone even try to do this, when it can be done cheaper by stream-processing? We’ve done that 20 years ago with awk.

1 Like

Are you implicitly saying that Julia is not such a good fit for general purpose programming? I struggle to believe that.

I would rather suspect that eachline() just isn’t as optimized as it should be. I don’t see a reason why JIT compiled code would run significantly slower than statically compiled code.

In my case, the difference between Perl and Julia is 20s. (Same logic, same structure) That difference cannot be explained by the initial pre-delay of JIT compilation.

1 Like

I wouldn’t, I’d used wc -l < log.txt. But you were the one complaining about eachline() being “slow”, so I’m confused now.

I’m not, I’m saying “run regex for each line of a big text file” is not exactly some high-level complex use-case calling for a general programming language, you can use awk grep some GNU/POSIX utility to do that.

Julia can do all of these just fine of course, and there’s certainly room for improvement in I/O.

1 Like

I am trying to use Julia’s eachline() in a way one would use awk. I would not expect Julia to be significantly slower than any other language, including compiled C.

I’m not sure I’d say significant. I suspect perl is doing some trickery since you are not using the line being read. Try these two programs they both report their speed in seconds so hopefully comparing apples to apples:

Perl:

#!/bin/perl -w
use strict;
use warnings;
use utf8;
use Time::HiRes qw(time);

my $start   = time();
my $counter = 0;
while(<>) {
    $counter = $counter + length($_);
}
my $total = time() - $start;
print "STDIN read time ($counter chars): $total\n";

Julia:

using Dates

function from_stdin()
    start = now()
    counter = 0
    for line = eachline(;keep=true)
        counter += length(line)
    end
    total = Dates.value(now() - start)/1000
    println("STDIN read time ($counter chars): $total");
end
from_stdin()

For me on a 100 million line file I get:

$ ./ptest < 100_000_000.txt 
STDIN read time (1388888898 chars): 11.8912749290466

$ julia test.jl < 100_000_000.txt 
STDIN read time (1388888898 lines): 12.91

So a difference of 1 second over a runtime of 12 seconds. Perl still runs faster on this small test, no clue where the overhead is. I suspect Julia might be copying bytes out of the buffer when it converts them to a string, and Perl might not be, but that’s just a wild guess.

1 Like

@pixel27 Thanks for taking the time.

I used these exact code snippets you suggested, just added a shebang to the Julia code at the beginning.
I can still see a significant difference:

$ cat access.log | ./testloopspeed2.pl
STDIN read time (3013196840 chars): 20.3576831817627

$ cat access.log | ./testloopspeed2.jl
STDIN read time (3011271087 chars): 39.258

I am using a real logfile in this test.
My version of Julia is 1.5.3.

So what is your operating system, RAM, CPU, hard drive, perl version? Did you try Julia 1.6rc1?

why does it matter? they have test against Perl on the same system