Hi
I am trying to use Julia for general purpose programming.
I am trying to use Julia to create a faster version of a Perl script, which is supposed to create a report on a large logfile, piped into STDIN. The logfile does not fit into RAM, hence the data needs to be read from STDIN line-by-line from the data stream.
My problem is that Julia (1.5.3) is twice as slow as Perl.
I was able to identify one of the main problems, which is the code to read line-by-line from the input stream.
I use the following test code only to illustrate the issue.
Perl:
#!/bin/perl -w
use strict;
use warnings;
my $counter = 0;
while(<>) {
$counter++;
}
print "Perl: Number of lines: $counter\n";
Julia:
#!/bin/julia
counter = 0
for line = eachline()
global counter += 1
end
println("Julia: Number of lines: $counter")
Performance Perl:
# time (cat access.log | ./testloopspeed.pl)
Perl: Number of lines: 19567100
real 0m15.251s
user 0m10.475s
sys 0m8.209s
Performance Julia:
# time (cat access.log | ./testloopspeed.jl)
Julia: Number of lines: 19557160
real 0m36.974s
user 0m27.348s
sys 0m6.368s
During runtime of above test code, I noticed that Perl is using much less RAM.
Language
Resident-Memory
Virtual-Memory
Perl
2,4MB
127MB
Julia
185MB
686MB
What am I doing wrong?
Please notice, that there is another topic on a very similar subject here:
First thing, never write code directly on the global scope, write inside a function (even if it is a main taking no arguments and returning nothing) and call the function. Do not use global variables if possible (use variables local to this function instead). For things that will run a single time, and that compilation time may be the culprit try calling julia passing the --compile=min flag.
Hello @Henrique_Becker
Thanks for your comments.
Do you think you general recommendations would make the Julia code run twice as fast?
Another question from me would be, how I would access STDIN from within a function? STDIN is a global handler, how would I pass a reference to it as argument to a function?
in this case, --compile=min only makes things slower:
~ » time (cat blah.log | julia16 --startup-file=no blah.jl)
Julia: Number of lines: 100000
( cat blah.log | julia16 --startup-file=no blah.jl; ) 0.25s user 0.44s system 355% cpu 0.195 total
------------------------------------------------------------------------------------------------------------------------
~ » time (cat blah.log | julia16 --compile=min --startup-file=no blah.jl)
Julia: Number of lines: 100000
( cat blah.log | julia16 --compile=min --startup-file=no blah.jl; ) 4.30s user 0.45s system 111% cpu 4.267 total
Perl:
0.02s user 0.00s system 101% cpu 0.020 total
and putting things into function doesn’t help because you’re just running it once
it doesn’t matter in this case, you will know once you’ve tried it.
function f()
counter = 0
for line = eachline()
counter += 1
end
counter
end
println("Julia: Number of lines: $(f())")
~ » time (cat blah.log | julia16 --startup-file=no blah.jl)
Julia: Number of lines: 100000
( cat blah.log | julia16 --startup-file=no blah.jl; ) 0.25s user 0.44s system 360% cpu 0.191 total
I think in this case counter being a global variable only added negligible overhead compare to the really slow readline()
I don’t think Julia can perform better with such a small test. There is barely any logic being executed. 99% of the perl code is probably C i.e. not interpreted. With Julia you have the compile time getting in the way so you need a file and logic big enough to dwarf that time.
I used this Julia file:
using Dates
function from_stdin()
start = now()
counter = 0
for line = eachline()
counter += 1
end
finish = now()
println("STDIN read time ($counter lines): $(finish - start)");
end
from_stdin()
With 10 million lines perl is about 1/3 faster:
$ time ./ptest < 10_000_000.txt
Perl: Number of lines: 10000000
real 0m0.937s
user 0m0.893s
sys 0m0.043s
$ time julia read.jl < 10_000_000.txt
STDIN read time (10000000 lines): 1214 milliseconds
real 0m1.430s
user 0m1.350s
sys 0m0.569s
When I go up to 1,000,000,000 lines the differences get narrower:
$ time julia read.jl < 1_000_000_000.txt
STDIN read time (1000000000 lines): 113656 milliseconds
real 1m53.871s
user 1m44.570s
sys 0m9.574s
$ time ./ptest < 1_000_000_000.txt
Perl: Number of lines: 1000000000
real 1m36.142s
user 1m25.340s
sys 0m10.625s
My test files are not that big, the 10,000,000 line file is only 123MiB while the 1,000,000,000 line file is about 14GiB, so both can fit into memory.
Just for grins I moved the 14GiB file into /tmp which is a tmpfs file system (RAM) to remove any disk access times and I get:
$ time ./ptest < /tmp/1_000_000_000.txt
Perl: Number of lines: 1000000000
real 1m31.380s
user 1m28.799s
sys 0m2.390s
$ time julia read.jl < /tmp/1_000_000_000.txt
STDIN read time (1000000000 lines): 108426 milliseconds
real 1m50.210s
user 1m45.692s
sys 0m3.708s
No. However, I do not seem why to not use them either. It is easier to just use them from start than benchmarking the difference they make.
You can just pass stdin to any function, as a parameter, in my suggestion (creating a main function) you could pass it to main in the call. However, this will problaby make little difference, as pointed out by @jling, and I was interested in making counter a local variable, I was not even thinking about stdin.
No, I actually really need to perform some logic and use regex to extract substrings.
I tried running the full program with all logic in Julia, and it was twice as slow as Perl.
Then I tried to find the reason, and I suspect the reason for Julia being so slow is in eachline() itself.
apparently “some logic” is too light to make a difference.
regex is not something Julia can shine in doing, we just call http://www.pcre.org/ so… again, Perl probably is just running C code here which is already optimized. This is similar to if you benchmark some linear algebra code, everyone will just be calling OpenBLAS/MLK
I initially had ran the Julia program with all logic in it, so that it would extract substrings, store them in Dictionaries, and at the end the reporting.
As a result, Julia was twice as slow as Perl. I investigated by reducing more and more logic, and came to the conclusion, that it must be eachline() that’s so slow.
14GB won’t fit into my memory. But why would anyone even try to do this, when it can be done cheaper by stream-processing? We’ve done that 20 years ago with awk.
Are you implicitly saying that Julia is not such a good fit for general purpose programming? I struggle to believe that.
I would rather suspect that eachline() just isn’t as optimized as it should be. I don’t see a reason why JIT compiled code would run significantly slower than statically compiled code.
In my case, the difference between Perl and Julia is 20s. (Same logic, same structure) That difference cannot be explained by the initial pre-delay of JIT compilation.
I’m not, I’m saying “run regex for each line of a big text file” is not exactly some high-level complex use-case calling for a general programming language, you can use awk grep some GNU/POSIX utility to do that.
Julia can do all of these just fine of course, and there’s certainly room for improvement in I/O.
I am trying to use Julia’s eachline() in a way one would use awk. I would not expect Julia to be significantly slower than any other language, including compiled C.
I’m not sure I’d say significant. I suspect perl is doing some trickery since you are not using the line being read. Try these two programs they both report their speed in seconds so hopefully comparing apples to apples:
Perl:
#!/bin/perl -w
use strict;
use warnings;
use utf8;
use Time::HiRes qw(time);
my $start = time();
my $counter = 0;
while(<>) {
$counter = $counter + length($_);
}
my $total = time() - $start;
print "STDIN read time ($counter chars): $total\n";
Julia:
using Dates
function from_stdin()
start = now()
counter = 0
for line = eachline(;keep=true)
counter += length(line)
end
total = Dates.value(now() - start)/1000
println("STDIN read time ($counter chars): $total");
end
from_stdin()
For me on a 100 million line file I get:
$ ./ptest < 100_000_000.txt
STDIN read time (1388888898 chars): 11.8912749290466
$ julia test.jl < 100_000_000.txt
STDIN read time (1388888898 lines): 12.91
So a difference of 1 second over a runtime of 12 seconds. Perl still runs faster on this small test, no clue where the overhead is. I suspect Julia might be copying bytes out of the buffer when it converts them to a string, and Perl might not be, but that’s just a wild guess.