Reading a few rows from a BIG CSV file

dlakelan · September 23, 2021, 2:39am

I’m reading a Census bureau CSV file that’s 2.2M rows long.

I’d like to just read the first hundred rows to check my stuff works…

df = Iterators.take(CSV.Rows("filename.csv"),100) |> DataFrame

Doesn’t terminate in any reasonable amount of time (like minutes). I would imagine this should take milliseconds to a second.

What am I doing wrong?

quinnj · September 23, 2021, 3:00am

Can you share the file causing the issue? Or point me to where you’re getting the data so I can try to reproduce the issue?

gustaphe · September 23, 2021, 3:53am

df = CSV.read(filename, DataFrame; limit=100)

?

dlakelan · September 23, 2021, 1:57pm

https://www2.census.gov/programs-surveys/acs/data/pums/2019/5-Year/csv_hus.zip

Unzip and you will find various files I’m reading the husa one.

I’ll give that a try!

dlakelan · September 23, 2021, 4:31pm

I just tried this, I killed it after about 1 minute.

full file name is psam_husa.csv

head -100 psam_husa.csv is “instantaneous” so it’s not some kind of weird filesystem issue

gustaphe · September 23, 2021, 4:46pm

I think it has to do with the width of it. There are 238 columns. But I found if I first did df = CSV.read(filename, DataFrame; limit=1), that took a couple of seconds, and then df = CSV.read(filename, DataFrame; limit=100) was more or less instantaneous. So maybe you just need a dry run to compile all of the individual parsing operations.

gustaphe · September 23, 2021, 4:47pm

Actually, I restarted julia and tried @time df = CSV.read(filename, DataFrame; limit=100), and it took just over 17 s, so maybe that wasn’t your problem to begin with.

dlakelan · September 23, 2021, 4:52pm

why would compilation take longer when there are 100 lines than with 1 line?

if I have it do 1 line and then 100 it also is fast for me. I’ll restart and see what’s up.

Yes, when I restart, doing 1 line, and then 100 lines is “fast” (ie. like the 10-15 seconds you mention)

And now, when I restart doing 100 lines by itself is also similarly fast… no longer more than several minutes now it’s 10-15 seconds.

WTH?

I’m going to try the original version as well now.

The Iterators.take version is definitely taking more than a minute.

even doing 10000 lines now with the limit= version is executing in 4 seconds from a fresh start. so yeah, I don’t understand what was wrong.

gustaphe · September 23, 2021, 4:55pm

Curious to see if anyone knows more, I sure don’t. Very interesting.

dlakelan · September 23, 2021, 5:14pm

Ok, so here’s the “real” issue. What I’d like to do is read through this household file and pull out a sample of only the households that meet some criteria. Let’s say a 20% random sample of households with 2 or more adults and at least 1 child…

I don’t really want to read 2M rows and then throw away say 90% of them…

I had imagined if I had an iterator to the rows, I could use filter on it and then build the dataset that way. But if it’s crazy slow… that’s not good either.

rafael.guerra · September 23, 2021, 6:34pm

Not sure if this is what is requested but just in case.
The following code takes <1 min to read the whole CSV file, extracting ~20% of rows meeting the condition NP >= 3 (NP is on column 10):

using DelimitedFiles
file = raw"C:\...\psam_husa.csv"
M =  Vector{Vector{Any}}(undef,0)
open(file, "r") do io
   readline(io)
   while !eof(io)
      r = split(readline(io), ',')
      if (parse(Int, r[10])>=3) && (rand(1:5)==1)
         push!(M, r)
      end
   end
end

dlakelan · September 23, 2021, 7:18pm

Yes, I can definitely reimplement reading CSV files but that’s not the idea. The question is how to use existing CSV file reading libraries to stream the file through a filtering operation into a DataFrame, where the filter operates at the parsed-data level (so that I can for example use more complex conditions, and based on the column names etc.)

It seems CSV.Rows is designed exactly for this, and also the Query / Queryverse libraries have similar things, but they all seem to bork on this file.

I’m trying the following:

df = Iterators.filter(x-> rand() < .2 && parse(Int,x[:NP]) >= 3,CSV.Rows("psam_husa.csv")) |> DataFrame

I’m just letting it run… we’ll see it’s 12:17 pm right now, if it hasn’t finished by 12:27 I’ll kill it, otherwise I’ll see what the time it took was.

EDIT: it took 153 seconds, and produced 136k rows, so I guess whatever was causing my earlier problem has subsided… perhaps it was some issue mitigated by my computer sleeping and waking back up or something like that. I’ll see how it does if I give it data types for the columns.

Sukera · September 23, 2021, 7:22pm

I think you can also try CSV.File instead of CSV.read, which uses Mmap to memory map the file instead of trying to load it all into memory. Perhaps that has something to do with it, maybe you’re just low on memory and swapping kills all performance?

dlakelan · September 23, 2021, 7:24pm

Interesting thought. One thing I didn’t mention is that this file is on my home directory which is NFS v4 mounted from a NAS. I don’t know how that’ll interact with mmap, but the CSV.Rows interface shouldn’t have to read the whole file into memory right?

Sukera · September 23, 2021, 7:28pm

Ah, I’m not familiar with CSV.Rows - seems that is new-ish. Inferring from the docstring, it should be the same if not faster than CSV.File Perhaps there’s a lot of network traffic happening due to it being on an NFS and “reading by row” requiring a lot more queries over your network than expected?

If all else fails, I guess @quinnj will take a look at it - it should be debuggable thanks to the file you provided, but as suspected above, the wide format may be a part of the problem…

dlakelan · September 23, 2021, 7:37pm

Ok, I’m trying this:


df1 = CSV.read("psam_husa.csv",DataFrame,limit=1)
thetypes = [typeof(df1[1,i]) for i in 1:size(df,2)]

df = Iterators.filter(x-> rand() < .2 && x[:NP] >= 3,CSV.Rows("psam_husa.csv";types=thetypes)) |> DataFrame

That failed when it got to some rows that didn’t have correctly detected types. I’m manually setting a few of the types and continuing… will see what happens.

Ok this worked!


df1 = CSV.read("psam_husa.csv",DataFrame,limit=1)
thetypes = vcat([String,String],[typeof(df1[1,i]) for i in 3:size(df1,2)])

df = Iterators.filter(x-> rand() < .2 && x[:NP] >= 3,CSV.Rows("psam_husa.csv";types=thetypes)) |> DataFrame

Took 118 seconds and produced 136k rows… so I guess that’s the solution.

Sukera · September 23, 2021, 7:44pm

Ah, I suspect this could be problematic - maybe piping into DataFrame forces it all to be read into memory, negating the benefit of having it a streamable interface Note the docs of CSV.Rows:

The returned CSV.Rows object supports the Tables.jl interface and can iterate rows.

So you may not necessarily need to use a proper DataFrame. Would also be consistent with not providing a sink type, as is required for CSV.read.

I’m out of my depth here though, as I haven’t kept up with the various ongoings, 1.0s and releases of neither CSV nor DataFrames in the past couple of months, I only know that things have been moving

Henrique_Becker · September 23, 2021, 8:05pm

Did someone try CSV.Chunks? It seems similar to CSV.Rows but instead of bing row-by-row you can decide how much will be read at each time.

quinnj · October 1, 2021, 5:40am

118 seconds for something like this sounds way way high. Is it 118 seconds for all 3 lines of code to run? Or just the filter + CSV.Rows + DataFrame (3rd line)?

Couple of thoughts:

There might be some kind of performance bug in CSV.Rows right now? I haven’t checked its performance in a while, so it’s possible something has crept in to make it slow
There’s TableOperations.filter that will do row filtering lazily as the DataFrame is built; might be more efficient than Iterators.filter
I started working a while ago on the ability to filter rows while parsing; it sounds like this would be a really good use-case for that. If so, feel free to comment on the PR and maybe I can find time soon to work on it again.

dlakelan · October 1, 2021, 5:58am

Just that. The file is rather large… Millions of lines and hundreds of columns, also it’s stored on a glusterfs server and being accessed by NFSv4 mount. Those may affect speed.

Topic		Replies	Views
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2058	February 23, 2024
Failing to import (relatively) large CSV file with Julia and VSC Data performance , csv , arrow	24	754	September 22, 2024
CSV read in is too slow than other language General Usage performance	13	1363	June 21, 2023
Reading Data Is Still Too Slow Data	35	8818	August 2, 2019
CSV read performance vs Pandas General Usage	29	8158	May 6, 2019

Reading a few rows from a BIG CSV file

Related topics