CSV.Row very slow for reading files line by line

amael · May 9, 2023, 4:39pm

I’m trying to use CSV.Row as well as ViewReader to read files line by line as efficiently as possible (following the topic ViewReader discussion). Ultimately the goal is to process on the fly very large gzipped files which alternates data (atomic configurations) and headers (number of atoms, simulation box informations, etc…), thus not importing all the file at once in memory.
I’m benchmarking for a very simple file several strategies based on eachline (Base), eachlineV (ViewReader) and CSV.Row (CSV), and for the sake of comparison I’m also benchmarking CSV.File, CSV.read and readdlm. The benchmarking code is self contained (
mwe.jl (4.4 KB)
) is
filling two matrices stored in a struct, with floats (4 columns) and ints (6 colums), from the datafile. Conversion from Dataframes to matrix and allocation is extremely fast even though the timing takes that into account when reading the file all at once (CSV.File, CSV.read and readdlm).

For 2000 lines I have the following timing (julia-1.9.rc3 -O3) on an apple M1 max chip (arm binary):

viewreader
1.815 ms (14 allocations: 20.35 KiB)
reader
1.235 ms (10026 allocations: 2.70 MiB)
csvperrow
4.358 ms (45452 allocations: 1.67 MiB)
csvread
896.625 μs (2180 allocations: 369.51 KiB)
csvfile
882.958 μs (2133 allocations: 211.15 KiB)
readdlm
3.062 ms (54882 allocations: 1.95 MiB)

and (for reference) on an intel xeon gold 5220 (and nvme ssd) I have

viewreader
2.277 ms (14 allocations: 20.35 KiB)
reader
1.920 ms (10026 allocations: 2.70 MiB)
csvperrow
6.776 ms (45444 allocations: 1.67 MiB)
csvread
1.610 ms (2175 allocations: 369.43 KiB)
csvfile
1.528 ms (2129 allocations: 211.09 KiB)
readdlm
4.075 ms (54882 allocations: 1.95 MiB)

Obviously CSV.File and CSV. read are the fastest. The memory consumption of the solution based on ReadView is negligible. But is slower than the one based on eachline (reader), maybe this PR (PR) is integrated in the rc3?.

The CSV.Row solution is very slow compared to the others, why?, is there something I’m doing wrong? or there is room for improvement of CSV.Row?

Sorry for the quite complex post and code. I hope you’ll find that intersting.
Best
Amaël

Topic		Replies	Views
CSV vs DelimitedFiles vs Numpy Performance	15	974	January 20, 2024
Making string to float conversion faster? General Usage	16	1113	March 14, 2021
CSV read in is too slow than other language General Usage performance	13	1363	June 21, 2023
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2058	February 23, 2024
Not able to read a csv file New to Julia csv	19	2496	March 2, 2022

CSV.Row very slow for reading files line by line

Related topics