Queryverse `load(f) ▷ @take(3)` appears to read the whole file

Just watched David Anthoff’s awsome Introduction to the Queryverse video.

David emphasises that the Queryverse is designed to work in a “streaming” fashion, and that data is supposed to be “pulled” through the stages in a data pipeline.

With that in mind, I went and wrote:

load(f) ▷ @take(rows) ▷ collect

where f is a string to a 2.3Gb CSV file, and rows==3. This takes 93 seconds, which seems much more like that the load is actually reading the whole file. I came up with the test script below, followed by the output results:

#!julia
# ARGS=["nycTaxiTripData2013/data/trip_data_1.csv"]
f = ARGS[1]
run(`ls -lath $f`)
using Queryverse 
using CSV

rows = 3
for _ in 1:2 # do it twice, as the first time will have compilation overhead
    d1, d2, d3 = nothing, nothing, nothing
    GC.gc()
    @time d1 = open(`head -n$(rows+1) $f`) do io
        load(Stream{format"CSV"}(io)) ▷ collect
    end;
    @time d2 = load(f) ▷ @take(rows) ▷ collect;
    @time d3 = CSV.File(f, limit=rows);
end
-rwxr-xr-x 1 derek derek 2.3G May 13  2014 trip_data_1.csv
  9.201231 seconds (19.78 M allocations: 1.188 GiB, 3.02% gc time, 6.52% compilation time)
 93.247378 seconds (539.58 M allocations: 69.246 GiB, 50.05% gc time, 0.08% compilation time)
 12.923268 seconds (27.43 M allocations: 1.424 GiB, 20.18% gc time, 94.82% compilation time)

  0.004426 seconds (3.03 k allocations: 369.062 KiB)
152.283909 seconds (526.68 M allocations: 68.517 GiB, 71.67% gc time)
  0.000310 seconds (412 allocations: 31.977 KiB)

So, on the second run, I can see that CSV.File is taking <1ms, Using head in the OS + Queryverse is taking 4.4ms, but pure Queryverse (load ▷ @take) takes a whopping 93s the first time, and 152s the second (which is totally unfathomable to me!)

Have I misunderstood the streaming nature of the Queryverse? Am I mssing something else?

Sorry for the slow/lack of responses here. Your understanding is correct, though. https://github.com/queryverse/CSVFiles.jl doesn’t support streaming rows as far as I understand (maybe @davidanthoff can confirm). Using CSV.File is very fast, as you’ve noted, but is also parsing the entire file before rows are sent to the next data processing operation.

In the CSV.jl package, you can use the CSV.Rows functionality to get a more pure streaming solution. Except for the case where the input is a compressed file (which I’m working on a fix for soon: https://github.com/JuliaData/CSV.jl/issues/733), this will parse a file row-by-row, and if you pass CSV.Rows(file; reusebuffer=true), it will only allocate a single buffer that values will be parsed into; i.e. each row is only valid while it is the currently iterated element. To hook into Queryverse data operations, you can just do CSV.Rows(file) |> Tables.datavaluerows |> @take(rows). The Tables.datavaluerows function just ensures that the input csv rows are converted to NamedTuples with DataValue used instead of missing for missing values, as is required by Queryverse.

Hope that helps!

1 Like

Yep, that is correct, CSVFiles.jl always reads the entire file into memory first. @quinnj’s solution is the way to go if you want to stream!

@quinnj I’m actually wondering whether we could even get rid of the need for the manual Tables.datavaluerows call when one streams rows with CSV.Rows? I think if we just added

IteratorInterfaceExtensions.getiterator(x::Rows) = Tables.datavaluerows(x)
IteratorInterfaceExtensions.isiterable(::Rows) = true
TableTraits.isiterabletable(::Rows) = true

to CSV.jl (i.e. pretty much identical to what we do in DataFrames.jl at the moment) it should just work without the need for that extra call? And I don’t think that would pull in any new dependency, right?