Is this an efficient way to read a .csv file row by row?

bsnyh · January 24, 2019, 7:40pm

I have a .csv file and I’d like to read it row by row. The following is part of the code. Do you have better suggestions other than this? I’ve googled this, but the answers I found are quite old.

 ```
 df=CSV.read("the path to my .csv file")
 ambulances = Vector{Ambulance}(nrow(df))
 for i in 1:size(df,1)
      # do something to each element of the i-th row 
      # use df[i, 1], df[i,2], df[i,3] ,.... df[i,n] to access each element of the i-th row
end

ExpandingMan · January 24, 2019, 7:43pm

See the CSV.jl docs. You can create a CSV.File object, which you can iterate over row by row.

bsnyh · January 24, 2019, 7:46pm

@ExpandingMan, thank u for quick reply. Using CSV.File object is quicker?

ExpandingMan · January 24, 2019, 7:49pm

Well, probably yes but in general it wouldn’t need to be.

When you create a CSV.File object and iterate over it you will be lazily iterating over the rows, so in other words you’ll only be reading them in as you iterate. You therefore will not have to allocate memory to first read badly formatted data into a DataFrame, then fix the format. Instead, you can fix the format of each row as you go along (or whatever it is you’re doing).

bsnyh · January 24, 2019, 8:02pm

@ExpandingMan, is there a way to get the total number of rows? Probably not, right? as the rows are iterated over one by one.

ExpandingMan · January 24, 2019, 8:05pm

Right, that’s one of the things that’s so terrible about a CSV: there’s no way of knowing the number of rows until you read the whole thing. That said, reading through the whole thing just to count the number of rows can be done more quickly than copying it all into memory. Of course, in the command line you could do wc -l… I’m not sure if CSV.File gives a nice way of doing this.

bsnyh · January 25, 2019, 11:41am

@ExpandingMan, are there better ways to avoid this? Like other file handling techniques, other file format etc?

bsnyh · January 25, 2019, 11:43am

@ExpandingMan, maybe the data (the .csv file) can be saved column wise instead of row wise?

ExpandingMan · January 25, 2019, 2:20pm

I’m a maintainer of Feather.jl. That format definitely has its own problems, but if you are just looking for fast easy storage of tabular data it nevertheless is a pretty good option. I use Feather quite frequently.

There is of course also Parquet.jl but Parquet is a more complicated format intended for really huge datasets.

quinnj · January 27, 2019, 5:34am

I agree w/ @ExpandingMan that Feather.jl is a better overall data file format for tabular data. Do note that you can call f = CSV.File(file); length(f) to get the # of rows in a csv file. This is because the CSV.File constructor scans the entire file to determine the # of rows upfront. If you happen to know the # of rows before parsing, you can also pass CSV.File(file; limit=number_of_rows) and it will speed up the initial file scan a bit (since it knows exactly how many rows to expect). It’s also obviously useful for cases when you only want to read a specific set of rows from a file (in conjunction with the skipto argument, which is like an offset into the file you want to start reading from).

Topic		Replies	Views
CSV.Rows usage & Tables.jl interface Data csv , io	8	1376	December 22, 2021
.csv number of rows Data csv	6	3212	September 13, 2022
DataFrames/CSV: how to read vectors from *.csv? General Usage	9	2741	March 26, 2021
Read text file containing proper CSV data chunks Data dataframes , csv	11	526	December 20, 2023
Handle large csv file using `enumerate(CSV.File())` or `CSV.read()`? New to Julia	3	545	April 21, 2019

Is this an efficient way to read a .csv file row by row?

Related topics