I think it has to do with the width of it. There are 238 columns. But I found if I first did df = CSV.read(filename, DataFrame; limit=1), that took a couple of seconds, and then df = CSV.read(filename, DataFrame; limit=100) was more or less instantaneous. So maybe you just need a dry run to compile all of the individual parsing operations.
Actually, I restarted julia and tried @time df = CSV.read(filename, DataFrame; limit=100), and it took just over 17 s, so maybe that wasn’t your problem to begin with.
Ok, so here’s the “real” issue. What I’d like to do is read through this household file and pull out a sample of only the households that meet some criteria. Let’s say a 20% random sample of households with 2 or more adults and at least 1 child…
I don’t really want to read 2M rows and then throw away say 90% of them…
I had imagined if I had an iterator to the rows, I could use filter on it and then build the dataset that way. But if it’s crazy slow… that’s not good either.
Not sure if this is what is requested but just in case.
The following code takes <1 min to read the whole CSV file, extracting ~20% of rows meeting the condition NP >= 3 (NP is on column 10):
using DelimitedFiles
file = raw"C:\...\psam_husa.csv"
M = Vector{Vector{Any}}(undef,0)
open(file, "r") do io
readline(io)
while !eof(io)
r = split(readline(io), ',')
if (parse(Int, r[10])>=3) && (rand(1:5)==1)
push!(M, r)
end
end
end
Yes, I can definitely reimplement reading CSV files but that’s not the idea. The question is how to use existing CSV file reading libraries to stream the file through a filtering operation into a DataFrame, where the filter operates at the parsed-data level (so that I can for example use more complex conditions, and based on the column names etc.)
It seems CSV.Rows is designed exactly for this, and also the Query / Queryverse libraries have similar things, but they all seem to bork on this file.
I’m just letting it run… we’ll see it’s 12:17 pm right now, if it hasn’t finished by 12:27 I’ll kill it, otherwise I’ll see what the time it took was.
EDIT: it took 153 seconds, and produced 136k rows, so I guess whatever was causing my earlier problem has subsided… perhaps it was some issue mitigated by my computer sleeping and waking back up or something like that. I’ll see how it does if I give it data types for the columns.
I think you can also try CSV.File instead of CSV.read, which uses Mmap to memory map the file instead of trying to load it all into memory. Perhaps that has something to do with it, maybe you’re just low on memory and swapping kills all performance?
Interesting thought. One thing I didn’t mention is that this file is on my home directory which is NFS v4 mounted from a NAS. I don’t know how that’ll interact with mmap, but the CSV.Rows interface shouldn’t have to read the whole file into memory right?
Ah, I’m not familiar with CSV.Rows - seems that is new-ish. Inferring from the docstring, it should be the same if not faster than CSV.File Perhaps there’s a lot of network traffic happening due to it being on an NFS and “reading by row” requiring a lot more queries over your network than expected?
If all else fails, I guess @quinnj will take a look at it - it should be debuggable thanks to the file you provided, but as suspected above, the wide format may be a part of the problem…
df1 = CSV.read("psam_husa.csv",DataFrame,limit=1)
thetypes = [typeof(df1[1,i]) for i in 1:size(df,2)]
df = Iterators.filter(x-> rand() < .2 && x[:NP] >= 3,CSV.Rows("psam_husa.csv";types=thetypes)) |> DataFrame
That failed when it got to some rows that didn’t have correctly detected types. I’m manually setting a few of the types and continuing… will see what happens.
Ok this worked!
df1 = CSV.read("psam_husa.csv",DataFrame,limit=1)
thetypes = vcat([String,String],[typeof(df1[1,i]) for i in 3:size(df1,2)])
df = Iterators.filter(x-> rand() < .2 && x[:NP] >= 3,CSV.Rows("psam_husa.csv";types=thetypes)) |> DataFrame
Took 118 seconds and produced 136k rows… so I guess that’s the solution.
Ah, I suspect this could be problematic - maybe piping into DataFrame forces it all to be read into memory, negating the benefit of having it a streamable interface Note the docs of CSV.Rows:
The returned CSV.Rows object supports the Tables.jl interface and can iterate rows.
So you may not necessarily need to use a proper DataFrame. Would also be consistent with not providing a sink type, as is required for CSV.read.
I’m out of my depth here though, as I haven’t kept up with the various ongoings, 1.0s and releases of neither CSV nor DataFrames in the past couple of months, I only know that things have been moving
118 seconds for something like this sounds way way high. Is it 118 seconds for all 3 lines of code to run? Or just the filter + CSV.Rows + DataFrame (3rd line)?
Couple of thoughts:
There might be some kind of performance bug in CSV.Rows right now? I haven’t checked its performance in a while, so it’s possible something has crept in to make it slow
There’s TableOperations.filter that will do row filtering lazily as the DataFrame is built; might be more efficient than Iterators.filter
I started working a while ago on the ability to filter rows while parsing; it sounds like this would be a really good use-case for that. If so, feel free to comment on the PR and maybe I can find time soon to work on it again.
Just that. The file is rather large… Millions of lines and hundreds of columns, also it’s stored on a glusterfs server and being accessed by NFSv4 mount. Those may affect speed.