I need to read a large .CSV file and parse each line to the attributes of a self-defined type. Before handling the CSV file, I was thinking to create a vector of a fixed length, which is the number of rows of the CSV file. Based on the past experiences, I thought that having the vector length fixed would be more efficient, in terms of vector element assigning. So, is there anyway to get to know the number of rows of the .CSV file without reading the whole file? Probably not, right? How do you manage to design your code to avoid this?
I would just
push! rows as a known type, probably a
NamedTuple or a
struct, to a
Vector. The implementation of
push! is very efficient (it preallocates) and is almost surely better than reading the file twice (which is the only way to establish the number of rows).
Thank you. Or on the other hand, if I look at the input files, and take a look at the number of rows there.
The CSV format relies on the newline character (and carriage return on Windows) to separate each record. Further, every record may have a different length. So, there is no way to find how many rows in the file without examining every single character in the file.
If you have a very large file then you can parallelize the read/count across multiple workers to speed that up.
@Tamas_Papp had a good suggestion for using
push! so you can avoid this work.