I’m not sure I’d say significant. I suspect perl is doing some trickery since you are not using the line being read. Try these two programs they both report their speed in seconds so hopefully comparing apples to apples:
Perl:
#!/bin/perl -w
use strict;
use warnings;
use utf8;
use Time::HiRes qw(time);
my $start = time();
my $counter = 0;
while(<>) {
$counter = $counter + length($_);
}
my $total = time() - $start;
print "STDIN read time ($counter chars): $total\n";
Julia:
using Dates
function from_stdin()
start = now()
counter = 0
for line = eachline(;keep=true)
counter += length(line)
end
total = Dates.value(now() - start)/1000
println("STDIN read time ($counter chars): $total");
end
from_stdin()
For me on a 100 million line file I get:
$ ./ptest < 100_000_000.txt
STDIN read time (1388888898 chars): 11.8912749290466
$ julia test.jl < 100_000_000.txt
STDIN read time (1388888898 lines): 12.91
So a difference of 1 second over a runtime of 12 seconds. Perl still runs faster on this small test, no clue where the overhead is. I suspect Julia might be copying bytes out of the buffer when it converts them to a string, and Perl might not be, but that’s just a wild guess.
It does matter because pixel27 reports that Julia is 10% slower than Perl and you report that it needs twice the time. So there must be something special either with your system or your Perl or your data. We cannot improve the performance gap if we do not understand the exact reason.
this should be counter += sizeof(line) to do the same work as Perl’s length (in Julia, length(line) is counting the number of codepoints, which is more work than counting bytes).
After this fix, using a file 100m with 100 million lines (generated by listing all files with find /, several times), I get the following times on my system (Linux, i5-7200U CPU @ 2.50GHz):
$ ./a.perl < 100m
STDIN read time (7626414140 chars): 11.7074089050293
# With Julia 1.5.3:
$ julia a.jl < 100m
STDIN read time (7626414140 chars): 14.781
# With Julia 1.6rc1:
$ julia a.jl < 100m
STDIN read time (7626414140 chars): 19.342
Looks like there’s a performance regression in 1.6rc1.
I wondered if the reported difference might be due to different memory-management strategies used by Perl and Julia. Perl uses a reference-counting collector, whereas Julia IIRC uses a non-compacting generational mark-sweep collector. That could mean that Perl might sometimes free allocated memory a bit more quickly (e.g., in the same loop iteration), and thereby use a smaller memory footprint, leading to higher CPU cache utilization.
I tried to run some quick experiments to confirm this, but I can’t actually reproduce the reported problem with Julia 1.6.0-rc1 and Perl 5.30.0 on my 2012-vintage Intel Core i7-3770 Ubuntu 20.04 desktop with 16 GB RAM:
My best guess is Julia needs to copy the bytes from the input buffer to the string. If eachline is written in Julia then when it generates strings the bytes need to be copied from the input buffer into the string object. If the string just referenced a “view” into the original buffer then you could violate the immutable string rule.
In Perl where the programmers (probably) have the code in non-perl they could play tricks with the strings and have them reference the original buffers, so no extra byte copy.
When I re-ran my scripts on a file with random lines from 128 to 1024 characters I see Perl running twice as fast as Julia. At this point I really don’t have the inclination to dig into the internal Perl or Julia code to see if I’m right, so it’s just a guess.
I’m am kind of curious what the performance of C++ would be because I think it would be under the same constraints as Julia and have a similar performance to Julia. I may find the motivation to test that today.
Well if eachline() is bad, the C++'s getline is horrible.
The Julia (jtest):
#!/usr/bin/julia
using Dates
function from_stdin()
start = now()
counter = 0
for line = eachline(;keep=true)
counter += length(line)
end
total = Dates.value(now() - start)/1000
println("STDIN read time ($counter chars): $(total)");
end
from_stdin()
I think simeonschaub was right, if you want to just count the number of lines you should use \n counter. If you want to apply some fast regex logic without allocations, then you should use something like Automa.jl. I do not know how perl works, but I guess each time Julia iterator produces the next chunk of data in eachline, it implicitly allocates with String(iochunk), so there is an overhead.
I mean you can do fast calculations by using raw bytes, these utilities are not so bad in Julia
function myreadline(io)
buf = Vector{UInt8}(undef, 1000)
counter = 0
while !eof(io)
nb = readbytes!(io, buf)
for i in 1:nb
if buf[i] == UInt8('\n')
counter += 1
end
# Here you can do something like `String(buf[prev_i:I])` and do regexp magic
end
end
println("Julia: Number of lines: $counter")
end
myreadline(stdin)
On my laptop timings are the following
sh> cat 1m.txt | time ./test.pl
Perl: Number of lines: 100000000
./test.pl 7.84s user 0.34s system 99% cpu 8.240 total
sh> cat 1m.txt | time julia --startup=no test.jl
Julia: Number of lines: 100000000
julia --startup=no xxx.jl 1.65s user 0.64s system 129% cpu 1.770 total
sh> cat 1m.txt | time wc -l
100000000
wc -l 0.93s user 0.21s system 99% cpu 1.137 total
Here 0.6s is a startup time of julia itself, so overhead is minimal, compared to wc -l.
So, it’s a tradeoff between convenience of work and performance.
Also, Unicode utilities are rather slow, so if you can, it’s better to work with Vector{Char}(or maybe Vector{UInt32}.
Julia is slightly faster then Perl, when the file is passed as redirection to STDIN.
Julia is significantly slower than Perl, when the file is passed through a pipe to STDIN.
Edit:
I would like to add that piping to STDIN still has good use cases.
Example:
Whilst trying to search for why the pipe case would be much slower, I found this discourse topic which lead to this GH issue and ultimately this (failed) PR. Perhaps it’s worth rebooting given stdin/stdout buffering has come up a a number of times in the past year.
import Base: iterate
struct Eachline{T <: IO}
io::T
buf::Vector{UInt8}
end
Eachline(io::IO, n::Integer) = Eachline(io, Vector{UInt8}(undef, n))
Eachline(io::IO) = Eachline(io, 10000)
function iterate(iter::Eachline, state = (1, 0))
iprev, nb = state
(eof(iter.io) & (iprev > nb)) && return nothing
s = ""
while true
if iprev > nb
nb = readbytes!(iter.io, iter.buf)
iprev = 1
end
for i in iprev:nb
if iter.buf[i] == UInt8('\n')
if (iprev == 1) & !isempty(s)
s *= String(iter.buf[1:i])
else
s = String(iter.buf[iprev:i])
end
return (s, (i + 1, nb))
end
end
s *= String(iter.buf[iprev:nb])
eof(iter.io) && return (s, (1, 0))
iprev = nb + 1
end
end
Then with the following char counting implementation
# iterators.jl
function main(io)
counter = 0
for s in Eachline(io)
counter += sizeof(s)
end
println("Julia: Number of chars: $counter")
end
main(stdin)
I get the following times
sh> cat 1m.txt | time julia --startup=no iterators.jl
Julia: Number of chars: 1400000000
julia --startup=no iterators.jl 7.53s user 1.05s system 103% cpu 8.314 total
sh> cat 1m.txt | time ./test2.pl
STDIN read time (1400000000 chars): 9.1769061088562
./test2.pl 8.86s user 0.32s system 99% cpu 9.201 total
(here test2.pl is the same as in pixel27 snippet).
Can anyone compare it on other datasets? Because it looks like in this implementation Julia is faster than Perl, despite enormous string allocations. Of course Automa.jl implementation can be less allocating and as a result much much faster.
If you like to live dangerously, you can get a performance boost by turning off locking:
julia> function iterlines(io)
counter = 0
for line in eachline(io)
counter += 1
end
return counter
end
iterlines (generic function with 1 method)
julia> precompile(iterlines, (IOStream,))
true
julia> open("/tmp/textfile.txt") do io
@time iterlines(io)
end
0.195130 seconds (1.01 M allocations: 107.697 MiB, 12.18% gc time)
1000000
julia> open("/tmp/textfile.txt") do io
io._dolock = false # turn off locking
@time iterlines(io)
end
0.086630 seconds (1.01 M allocations: 107.697 MiB, 26.46% gc time)
1000000
If you turn off locking, don’t even think about threading.
Hello @Skoffer
Just curious, did you try to write a more efficient version of eachline() with your code?
I used your code in the following tests (testloopspeed3.jl):
$ time (cat test.log | ./testloopspeed2.pl)
STDIN read time (4437005389 chars): 30.2816960811615
real 0m30.326s
user 0m19.187s
sys 0m16.329s
$ time (cat test.log | julia --startup=no testloopspeed3.jl)
Julia: Number of chars: 4437005389
real 0m40.995s
user 0m23.752s
sys 0m17.695s
I think your version is significantly faster than eachline(), but still 25% slower than Perl in my test.
The flag --startup=no does not give significant advantage.