Handle large csv file using `enumerate(CSV.File())` or `CSV.read()`?

bsnyh · April 19, 2019, 4:25pm

I need to read a large .CSV file and parse each line to the attributes of a self-defined type. Before handling the CSV file, I was thinking to create a vector of a fixed length, which is the number of rows of the CSV file. Based on the past experiences, I thought that having the vector length fixed would be more efficient, in terms of vector element assigning. So, is there anyway to get to know the number of rows of the .CSV file without reading the whole file? Probably not, right? How do you manage to design your code to avoid this?

Tamas_Papp · April 19, 2019, 5:07pm

I would just push! rows as a known type, probably a NamedTuple or a struct, to a Vector. The implementation of push! is very efficient (it preallocates) and is almost surely better than reading the file twice (which is the only way to establish the number of rows).

bsnyh · April 19, 2019, 5:33pm

Thank you. Or on the other hand, if I look at the input files, and take a look at the number of rows there.

tk3369 · April 21, 2019, 2:23pm

The CSV format relies on the newline character (and carriage return on Windows) to separate each record. Further, every record may have a different length. So, there is no way to find how many rows in the file without examining every single character in the file.

If you have a very large file then you can parallelize the read/count across multiple workers to speed that up.

@Tamas_Papp had a good suggestion for using push! so you can avoid this work.

Topic		Replies	Views
.csv number of rows Data csv	6	3306	September 13, 2022
Is this an efficient way to read a .csv file row by row? General Usage	9	3062	January 27, 2019
CSV.jl number of lines General Usage csv , io	13	1144	November 3, 2021
How can I split large data using a faster and more efficient function (data science)? New to Julia csv	9	808	October 27, 2022
Inconsistencies in the number of lines in a CSV file General Usage csv	3	493	November 23, 2023

Handle large csv file using `enumerate(CSV.File())` or `CSV.read()`?

Related topics