Skipping a lot of lines in CSV.read() allocates too much memory

,

If you just search for some solution, then the following works

using CSV, DataFrames

rows = eachline("test.csv")
m = 2
dfs = DataFrame[]

while !Iterators.isempty(rows)
    chunck = IOBuffer(join(Iterators.take(rows, m), "\n"))
    df = CSV.read(chunck, DataFrame; buffer_in_memory=true)
    push!(dfs, df)
end

Although it might be terribly slow due to lack of parallelism CSV.jl.


PS: Do you have control over the input CSV files? It feels much easier to split the files once in several files instead of dealing with such non-standard CSV files.

2 Likes