Handle large csv file using `enumerate(CSV.File())` or `CSV.read()`?

I need to read a large .CSV file and parse each line to the attributes of a self-defined type. Before handling the CSV file, I was thinking to create a vector of a fixed length, which is the number of rows of the CSV file. Based on the past experiences, I thought that having the vector length fixed would be more efficient, in terms of vector element assigning. So, is there anyway to get to know the number of rows of the .CSV file without reading the whole file? Probably not, right? How do you manage to design your code to avoid this?

I would just push! rows as a known type, probably a NamedTuple or a struct, to a Vector. The implementation of push! is very efficient (it preallocates) and is almost surely better than reading the file twice (which is the only way to establish the number of rows).

1 Like

Thank you. Or on the other hand, if I look at the input files, and take a look at the number of rows there.

The CSV format relies on the newline character (and carriage return on Windows) to separate each record. Further, every record may have a different length. So, there is no way to find how many rows in the file without examining every single character in the file.

If you have a very large file then you can parallelize the read/count across multiple workers to speed that up.

@Tamas_Papp had a good suggestion for using push! so you can avoid this work.

2 Likes