Just watched David Anthoff’s awsome Introduction to the Queryverse video.
David emphasises that the Queryverse is designed to work in a “streaming” fashion, and that data is supposed to be “pulled” through the stages in a data pipeline.
With that in mind, I went and wrote:
load(f) ▷ @take(rows) ▷ collect
where f
is a string to a 2.3Gb CSV file, and rows==3. This takes 93 seconds, which seems much more like that the load
is actually reading the whole file. I came up with the test script below, followed by the output results:
#!julia
# ARGS=["nycTaxiTripData2013/data/trip_data_1.csv"]
f = ARGS[1]
run(`ls -lath $f`)
using Queryverse
using CSV
rows = 3
for _ in 1:2 # do it twice, as the first time will have compilation overhead
d1, d2, d3 = nothing, nothing, nothing
GC.gc()
@time d1 = open(`head -n$(rows+1) $f`) do io
load(Stream{format"CSV"}(io)) ▷ collect
end;
@time d2 = load(f) ▷ @take(rows) ▷ collect;
@time d3 = CSV.File(f, limit=rows);
end
-rwxr-xr-x 1 derek derek 2.3G May 13 2014 trip_data_1.csv
9.201231 seconds (19.78 M allocations: 1.188 GiB, 3.02% gc time, 6.52% compilation time)
93.247378 seconds (539.58 M allocations: 69.246 GiB, 50.05% gc time, 0.08% compilation time)
12.923268 seconds (27.43 M allocations: 1.424 GiB, 20.18% gc time, 94.82% compilation time)
0.004426 seconds (3.03 k allocations: 369.062 KiB)
152.283909 seconds (526.68 M allocations: 68.517 GiB, 71.67% gc time)
0.000310 seconds (412 allocations: 31.977 KiB)
So, on the second run, I can see that CSV.File
is taking <1ms, Using head
in the OS + Queryverse is taking 4.4ms, but pure Queryverse (load ▷ @take
) takes a whopping 93s the first time, and 152s the second (which is totally unfathomable to me!)
Have I misunderstood the streaming nature of the Queryverse? Am I mssing something else?