Skipping a lot of lines in CSV.read() allocates too much memory

,

Hello everyone,

Consider the following example:

using CSV, DataFrames

number_of_lines = 10^4
CSV.write("data.csv", DataFrame(rand(number_of_lines,10), :auto))

steps = 20
chunk = round(Int,number_of_lines/steps)

for i in 1:steps
    @time a=CSV.read("data.csv",DataFrame,skipto=i*chunk ,limit=chunk )    
end

Effectively, we just split the csv file into “chunks” of rows, and read them one by one.

It seems that when more lines are skipped, a lot more memory is allocated, and thus for reading very big files in chunks (e.g the example above using 10^6 lines), getting towards the end of the file becomes painfully slow.

Is there a way to deal with this issue?
Would it be more appropriate to use another function to import the data perhaps?

1 Like

I think CSV.read needs to open the file again each time and parse it until it has reached the desired spot. So you get quadratic runtime in the number of chunks.

Can you just read the whole file (or up until the point you need) into a big DataFrame and then split the large Dataframe into chunks?

Alternatively, I found CSV.Chunks in the docs, which allows you to write something like this I think:

for chunk in CSV.Chunks("data.csv"; ntasks=steps)
    @time a = chunk |> DataFrame
end

But I think that still loads the entire file at once.

3 Likes

Thank you for the suggestion @abraemer.
CSV.Chunks is pretty fast indeed!
However, there’s some interesting behaviour;

using CSV, DataFrames

number_of_lines = 10^6
CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))

steps = 10

@time for chunk in CSV.Chunks("data.csv"; ntasks=steps)
    a = chunk |> DataFrame
    display(size((a)))
end

Output:

(99999, 10)
(99999, 10)
(100002, 10)
(100003, 10)
(99998, 10)
(99996, 10)
(99996, 10)
(99999, 10)
(100003, 10)
(100000, 10)
(100004, 10)
  0.791839 seconds (8.19 k allocations: 154.334 MiB, 2.68% gc time)

The chunks don’t have the same size for some reason, which is bad news if you have a strictly formatted file, with headers at specific intervals.

Inititally I thought the header is the problem in the MRE above, but even after deleting it manually, I get exactly the same behaviour.
Is that normal?

As an alternative, for anyone else reading this post, using numpy’s loadtxt() through PythonCall also does the job, but I’d rather stick to Julia if I have the option to do so.

I have no idea :smiley: However it says in the docs that Chunks is experimental…

But I think you can actually skip CSV.Chunks just use directly CSV.File:

file = CSV.File("data.csv")
for i in 1:steps
    @time a=file[i*chunk:i*chunk+chunk-1]    
end

That would be a good option, but sometimes the data file is prohibitively large. I would rather not load the whole thing into the ram.

Perhaps I should submit an issue regarding the Chunks behaviour, if no one else pops up to explain why that happens.

Have you tried CSV.Rows? I believe it only reads one row at a time. The downside is that it returns elements as strings, so you have to parse the strings yourself.

Thank you for the suggestion, @CameronBieganek .
I’m not quite sure how to use rows to extract the information that I need (skipping to a paticular line, reading a number of lines from there).

When I use CSV.Rows(“data.csv”) on the MRE above, the results are as follows:

CSV.Rows("data.csv"):
Size: 10
Tables.Schema:
 :x1   Union{Missing, PosLenString}
 :x2   Union{Missing, PosLenString}
 :x3   Union{Missing, PosLenString}
 ⋮
 :x8   Union{Missing, PosLenString}
 :x9   Union{Missing, PosLenString}
 :x10  Union{Missing, PosLenString}

and frankly, I’ve got no idea what that is.

If you just search for some solution, then the following works

using CSV, DataFrames

rows = eachline("test.csv")
m = 2
dfs = DataFrame[]

while !Iterators.isempty(rows)
    chunck = IOBuffer(join(Iterators.take(rows, m), "\n"))
    df = CSV.read(chunck, DataFrame; buffer_in_memory=true)
    push!(dfs, df)
end

Although it might be terribly slow due to lack of parallelism CSV.jl.


PS: Do you have control over the input CSV files? It feels much easier to split the files once in several files instead of dealing with such non-standard CSV files.

2 Likes

Thank you for the input @SteffenPL.

As I mention above, using numpy’s loadtxt() through PythonCall also does the job, and it’s pretty fast!

It just feels a bit wrong to rely on a python library for speed when using julia, so I thought perhaps I might be missing something when it comes to using the CSV.File or CSV.read functions.

As for the PS, I’m working on a file that’s almost half a billion lines long, split into 250000 different chunks, so I’d rather keep the file as is.

This will generally use less memory and allow you to transform and load into memory only the rows you need. I’m not sure if the all version of the package on the registry is currently working; so you might have to build the dev version. If you start Julia with many threads it will use multithreaded queries too. You can see the improvement of parquet vs csv in the 1 billon row writeup.

using DuckDB
db = DuckDB.DB()
DuckDB.execute(db, "CREATE VIEW data AS SELECT * FROM read_csv_auto('data.csv')");
@time a = DuckDB.execute(db, "SELECT * FROM data") |> DataFrame;
# convert to parquet on disk
DuckDB.execute(db, "COPY (SELECT * FROM read_csv_auto('data.csv')) TO 'data.parquet'");
@time a = DuckDB.execute(db, "SELECT * FROM 'data.parquet'") |> DataFrame;

This sounds like an excellent candidate for reading into SQLite. Do it once, and you can use SQL to repeatedly query for every subset / property of the data which you need. That’s what I’d do even if CSV.jl was optimally fast.

1 Like

@rdavis120 and @mnemnion, thank you for recommending these alternative packages, I was not aware of them.
Looks like I got some homework to do, to decide which solution would be optimal.

However, I think it would still be worth looking at making CSV.jl capable of dealing with such cases, since it is probably the most popular package for reading files like these. I will proceed with submitting an issue on their page, just in case.

1 Like

I also found it surprising that there is no easy solution (other than writing a own IOBuffer).

However, it’s not the scope of CSV.jl to deal with ill-formated CSV files. The condition of exactly having to meet the chunk size is also a challenge for parallelization (as one cannot predict where the new thread has to start since pointers to the right position in the file are unknown).

1 Like

Does your file have multiple header rows throughout the file? Because that’s a major headache for any CSV reader. Are the headers all the same, or do the data schemas change for each chunk?

The file I’m currently working with looks like this:

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
2000
ITEM: BOX BOUNDS pp pp pp
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0.239375 0.505765 0.143033
2 2 0.254461 0.510929 0.143544
3 2 0.240602 0.489869 0.142478
4 3 0.23796 0.503614 0.142924
5 1 0.595518 0.0928577 0.411568
6 2 0.583006 0.0833093 0.414176
.
.
.
1998 2 0.232469 0.378261 0.657207
1999 2 0.248144 0.373883 0.676491
2000 3 0.235467 0.369355 0.67263
ITEM: TIMESTEP
1
ITEM: NUMBER OF ATOMS
2000
ITEM: BOX BOUNDS pp pp pp
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0.242619 0.507873 0.136935
2 2 0.253103 0.515054 0.14658
.
.
.

and then repeats itself like that.

Pretty ugly format in my opinion, but it is what it is.
It comes from LAMMPS, a software for molecular dynamics simulations.

Anyhow, I don’t think we need to get into the specifics of my file format here, I just wanted to clarify whether I’m missing something in the CSV package.

I think part of the problem is that the Chunks function divides the file into ntasks parts not the number of lines.

If you try to create a file in which each line has the same size in bytes, the problem should not arise.

Interesting insight, but wouldn’t each line have the same size in the example above?

Regarding the bytes per line, this is for example one byte shorter…

On the contrary. Your file format is not even close to being a CSV. It’s not surprising that CSV.jl does not work well for that type of file.

Look for a library that provides a custom reader for LAMMPS data. Or else you will need to write your own LAMMPS file reader.

6 Likes

Oh I’d never imagine that to be honest.
Would that be due to the inconsistent number of columns?

Nevertheless, the issue persists even for the MRE above, which is a well-behaved csv file as far as I’m aware, so I reckon there’s still something to investigate here.