Skipping a lot of lines in CSV.read() allocates too much memory

SteffenPL · February 13, 2024, 1:29pm

If you just search for some solution, then the following works

using CSV, DataFrames

rows = eachline("test.csv")
m = 2
dfs = DataFrame[]

while !Iterators.isempty(rows)
    chunck = IOBuffer(join(Iterators.take(rows, m), "\n"))
    df = CSV.read(chunck, DataFrame; buffer_in_memory=true)
    push!(dfs, df)
end

Although it might be terribly slow due to lack of parallelism CSV.jl.

PS: Do you have control over the input CSV files? It feels much easier to split the files once in several files instead of dealing with such non-standard CSV files.

Topic		Replies	Views
Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version Performance dataframes	22	2285	June 29, 2020
.csv number of rows Data csv	6	3298	September 13, 2022
Reading a few rows from a BIG CSV file General Usage dataframes , csv , big-data	39	4539	January 18, 2024
How can I split large data using a faster and more efficient function (data science)? New to Julia csv	9	806	October 27, 2022
CSV.Row very slow for reading files line by line Performance package , csv	0	282	May 9, 2023

Skipping a lot of lines in CSV.read() allocates too much memory

Related topics