Is this an efficient way to read a .csv file row by row?


#1

I have a .csv file and I’d like to read it row by row. The following is part of the code. Do you have better suggestions other than this? I’ve googled this, but the answers I found are quite old.

 ```
 df=CSV.read("the path to my .csv file")
 ambulances = Vector{Ambulance}(nrow(df))
 for i in 1:size(df,1)
      # do something to each element of the i-th row 
      # use df[i, 1], df[i,2], df[i,3] ,.... df[i,n] to access each element of the i-th row
end

#2

See the CSV.jl docs. You can create a CSV.File object, which you can iterate over row by row.


#3

@ExpandingMan, thank u for quick reply. Using CSV.File object is quicker?


#4

Well, probably yes but in general it wouldn’t need to be.

When you create a CSV.File object and iterate over it you will be lazily iterating over the rows, so in other words you’ll only be reading them in as you iterate. You therefore will not have to allocate memory to first read badly formatted data into a DataFrame, then fix the format. Instead, you can fix the format of each row as you go along (or whatever it is you’re doing).


#5

@ExpandingMan, is there a way to get the total number of rows? Probably not, right? as the rows are iterated over one by one.


#6

Right, that’s one of the things that’s so terrible about a CSV: there’s no way of knowing the number of rows until you read the whole thing. That said, reading through the whole thing just to count the number of rows can be done more quickly than copying it all into memory. Of course, in the command line you could do wc -l… I’m not sure if CSV.File gives a nice way of doing this.


#7

@ExpandingMan, are there better ways to avoid this? Like other file handling techniques, other file format etc?


#8

@ExpandingMan, maybe the data (the .csv file) can be saved column wise instead of row wise?


#9

I’m a maintainer of Feather.jl. That format definitely has its own problems, but if you are just looking for fast easy storage of tabular data it nevertheless is a pretty good option. I use Feather quite frequently.

There is of course also Parquet.jl but Parquet is a more complicated format intended for really huge datasets.


#10

I agree w/ @ExpandingMan that Feather.jl is a better overall data file format for tabular data. Do note that you can call f = CSV.File(file); length(f) to get the # of rows in a csv file. This is because the CSV.File constructor scans the entire file to determine the # of rows upfront. If you happen to know the # of rows before parsing, you can also pass CSV.File(file; limit=number_of_rows) and it will speed up the initial file scan a bit (since it knows exactly how many rows to expect). It’s also obviously useful for cases when you only want to read a specific set of rows from a file (in conjunction with the skipto argument, which is like an offset into the file you want to start reading from).