I’m trying to use CSV.Row as well as ViewReader to read files line by line as efficiently as possible (following the topic ViewReader discussion). Ultimately the goal is to process on the fly very large gzipped files which alternates data (atomic configurations) and headers (number of atoms, simulation box informations, etc…), thus not importing all the file at once in memory.
I’m benchmarking for a very simple file several strategies based on eachline (Base), eachlineV (ViewReader) and CSV.Row (CSV), and for the sake of comparison I’m also benchmarking CSV.File, CSV.read and readdlm. The benchmarking code is self contained (
mwe.jl (4.4 KB)
filling two matrices stored in a struct, with floats (4 columns) and ints (6 colums), from the datafile. Conversion from Dataframes to matrix and allocation is extremely fast even though the timing takes that into account when reading the file all at once (CSV.File, CSV.read and readdlm).
For 2000 lines I have the following timing (julia-1.9.rc3 -O3) on an apple M1 max chip (arm binary):
1.815 ms (14 allocations: 20.35 KiB)
1.235 ms (10026 allocations: 2.70 MiB)
4.358 ms (45452 allocations: 1.67 MiB)
896.625 μs (2180 allocations: 369.51 KiB)
882.958 μs (2133 allocations: 211.15 KiB)
3.062 ms (54882 allocations: 1.95 MiB)
and (for reference) on an intel xeon gold 5220 (and nvme ssd) I have
2.277 ms (14 allocations: 20.35 KiB)
1.920 ms (10026 allocations: 2.70 MiB)
6.776 ms (45444 allocations: 1.67 MiB)
1.610 ms (2175 allocations: 369.43 KiB)
1.528 ms (2129 allocations: 211.09 KiB)
4.075 ms (54882 allocations: 1.95 MiB)
Obviously CSV.File and CSV. read are the fastest. The memory consumption of the solution based on ReadView is negligible. But is slower than the one based on eachline (reader), maybe this PR (PR) is integrated in the rc3?.
The CSV.Row solution is very slow compared to the others, why?, is there something I’m doing wrong? or there is room for improvement of CSV.Row?
Sorry for the quite complex post and code. I hope you’ll find that intersting.