Reading a few rows from a BIG CSV file

I’m reading a Census bureau CSV file that’s 2.2M rows long.

I’d like to just read the first hundred rows to check my stuff works…

df = Iterators.take(CSV.Rows("filename.csv"),100) |> DataFrame

Doesn’t terminate in any reasonable amount of time (like minutes). I would imagine this should take milliseconds to a second.

What am I doing wrong?

Can you share the file causing the issue? Or point me to where you’re getting the data so I can try to reproduce the issue?

df = CSV.read(filename, DataFrame; limit=100)

?

1 Like

https://www2.census.gov/programs-surveys/acs/data/pums/2019/5-Year/csv_hus.zip

Unzip and you will find various files I’m reading the husa one.

I’ll give that a try!

I just tried this, I killed it after about 1 minute.

full file name is psam_husa.csv

head -100 psam_husa.csv is “instantaneous” so it’s not some kind of weird filesystem issue

I think it has to do with the width of it. There are 238 columns. But I found if I first did df = CSV.read(filename, DataFrame; limit=1), that took a couple of seconds, and then df = CSV.read(filename, DataFrame; limit=100) was more or less instantaneous. So maybe you just need a dry run to compile all of the individual parsing operations.

Actually, I restarted julia and tried @time df = CSV.read(filename, DataFrame; limit=100), and it took just over 17 s, so maybe that wasn’t your problem to begin with.

1 Like

why would compilation take longer when there are 100 lines than with 1 line?

if I have it do 1 line and then 100 it also is fast for me. I’ll restart and see what’s up.

Yes, when I restart, doing 1 line, and then 100 lines is “fast” (ie. like the 10-15 seconds you mention)

And now, when I restart doing 100 lines by itself is also similarly fast… no longer more than several minutes now it’s 10-15 seconds.

WTH?

I’m going to try the original version as well now.

The Iterators.take version is definitely taking more than a minute.

even doing 10000 lines now with the limit= version is executing in 4 seconds from a fresh start. so yeah, I don’t understand what was wrong.

1 Like

Curious to see if anyone knows more, I sure don’t. Very interesting.

Ok, so here’s the “real” issue. What I’d like to do is read through this household file and pull out a sample of only the households that meet some criteria. Let’s say a 20% random sample of households with 2 or more adults and at least 1 child…

I don’t really want to read 2M rows and then throw away say 90% of them…

I had imagined if I had an iterator to the rows, I could use filter on it and then build the dataset that way. But if it’s crazy slow… that’s not good either.

Not sure if this is what is requested but just in case.
The following code takes <1 min to read the whole CSV file, extracting ~20% of rows meeting the condition NP >= 3 (NP is on column 10):

using DelimitedFiles
file = raw"C:\...\psam_husa.csv"
M =  Vector{Vector{Any}}(undef,0)
open(file, "r") do io
   readline(io)
   while !eof(io)
      r = split(readline(io), ',')
      if (parse(Int, r[10])>=3) && (rand(1:5)==1)
         push!(M, r)
      end
   end
end
3 Likes

Yes, I can definitely reimplement reading CSV files but that’s not the idea. The question is how to use existing CSV file reading libraries to stream the file through a filtering operation into a DataFrame, where the filter operates at the parsed-data level (so that I can for example use more complex conditions, and based on the column names etc.)

It seems CSV.Rows is designed exactly for this, and also the Query / Queryverse libraries have similar things, but they all seem to bork on this file.

I’m trying the following:

df = Iterators.filter(x-> rand() < .2 && parse(Int,x[:NP]) >= 3,CSV.Rows("psam_husa.csv")) |> DataFrame

I’m just letting it run… we’ll see it’s 12:17 pm right now, if it hasn’t finished by 12:27 I’ll kill it, otherwise I’ll see what the time it took was.

EDIT: it took 153 seconds, and produced 136k rows, so I guess whatever was causing my earlier problem has subsided… perhaps it was some issue mitigated by my computer sleeping and waking back up or something like that. I’ll see how it does if I give it data types for the columns.

1 Like

I think you can also try CSV.File instead of CSV.read, which uses Mmap to memory map the file instead of trying to load it all into memory. Perhaps that has something to do with it, maybe you’re just low on memory and swapping kills all performance?

Interesting thought. One thing I didn’t mention is that this file is on my home directory which is NFS v4 mounted from a NAS. I don’t know how that’ll interact with mmap, but the CSV.Rows interface shouldn’t have to read the whole file into memory right?

Ah, I’m not familiar with CSV.Rows - seems that is new-ish. Inferring from the docstring, it should be the same if not faster than CSV.File :thinking: Perhaps there’s a lot of network traffic happening due to it being on an NFS and “reading by row” requiring a lot more queries over your network than expected?

If all else fails, I guess @quinnj will take a look at it - it should be debuggable thanks to the file you provided, but as suspected above, the wide format may be a part of the problem…

2 Likes

Ok, I’m trying this:


df1 = CSV.read("psam_husa.csv",DataFrame,limit=1)
thetypes = [typeof(df1[1,i]) for i in 1:size(df,2)]

df = Iterators.filter(x-> rand() < .2 && x[:NP] >= 3,CSV.Rows("psam_husa.csv";types=thetypes)) |> DataFrame

That failed when it got to some rows that didn’t have correctly detected types. I’m manually setting a few of the types and continuing… will see what happens.

Ok this worked!


df1 = CSV.read("psam_husa.csv",DataFrame,limit=1)
thetypes = vcat([String,String],[typeof(df1[1,i]) for i in 3:size(df1,2)])

df = Iterators.filter(x-> rand() < .2 && x[:NP] >= 3,CSV.Rows("psam_husa.csv";types=thetypes)) |> DataFrame

Took 118 seconds and produced 136k rows… so I guess that’s the solution.

4 Likes

Ah, I suspect this could be problematic - maybe piping into DataFrame forces it all to be read into memory, negating the benefit of having it a streamable interface :thinking: Note the docs of CSV.Rows:

The returned CSV.Rows object supports the Tables.jl interface and can iterate rows.

So you may not necessarily need to use a proper DataFrame. Would also be consistent with not providing a sink type, as is required for CSV.read.

I’m out of my depth here though, as I haven’t kept up with the various ongoings, 1.0s and releases of neither CSV nor DataFrames in the past couple of months, I only know that things have been moving :sweat_smile:

2 Likes

Did someone try CSV.Chunks? It seems similar to CSV.Rows but instead of bing row-by-row you can decide how much will be read at each time.

2 Likes

118 seconds for something like this sounds way way high. Is it 118 seconds for all 3 lines of code to run? Or just the filter + CSV.Rows + DataFrame (3rd line)?

Couple of thoughts:

  • There might be some kind of performance bug in CSV.Rows right now? I haven’t checked its performance in a while, so it’s possible something has crept in to make it slow
  • There’s TableOperations.filter that will do row filtering lazily as the DataFrame is built; might be more efficient than Iterators.filter
  • I started working a while ago on the ability to filter rows while parsing; it sounds like this would be a really good use-case for that. If so, feel free to comment on the PR and maybe I can find time soon to work on it again.

Just that. The file is rather large… Millions of lines and hundreds of columns, also it’s stored on a glusterfs server and being accessed by NFSv4 mount. Those may affect speed.