Skipping a lot of lines in CSV.read() allocates too much memory

aris · February 13, 2024, 9:01pm

Thank you for the suggestion @CameronBieganek , that creates a very tidy file indeed, but I’m not sure whether this is relevant here.
First of all, it discards the BOX BOUNDS section, which I would like to keep/utilize **, and also it writes everything into a new file, which I would have to import again in order to do useful things with the data.

**these are nuances relevant to my very particular case, and I’d like to keep this post fairly general, hence I’d rather stick to the MRE in the original post.

Sorry for the confusion, to reiterate:
My files can be very big (like, 10gb big), so I can’t load the whole thing into my memory to do my computations.
My strategy thus, is to divide the file into “chunks”, and loop over them (import block of data, do calculations, move to the next).

As aforementioned, numpy does the job pretty well, using the following:

using CSV, DataFrames

number_of_lines = 10^5
CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))

steps = 10
chunk = round(Int, number_of_lines / steps)

# Numpy version, nice and fast
using PythonCall
np = pyimport("numpy")
for i in 1:steps
    @time a = pyconvert(Matrix,  np.loadtxt("data.csv", skiprows=i *chunk-1, max_rows=chunk,delimiter=','))
    #extra computations using "a" would go somewhere here 
end

# CSV version, painfully slow
for i in 1:steps
    @time a =CSV.read("data.csv", DataFrame, skipto=i * chunk, limit=chunk, delim=',')
end

It is also equally effective for my ugly lammps format example, without altering the file.

The main question is;
Does CSV or another julia package offer numpy’s performance (or better) in this case?

It does feel a bit akward looking at python libraries for speed purposes.
(I know numpy is written in C, but since speed is Julia’s main selling point, this doesn’t make much sense I believe).

Topic		Replies	Views
Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version Performance dataframes	22	2283	June 29, 2020
.csv number of rows Data csv	6	3294	September 13, 2022
Reading a few rows from a BIG CSV file General Usage dataframes , csv , big-data	39	4535	January 18, 2024
How can I split large data using a faster and more efficient function (data science)? New to Julia csv	9	805	October 27, 2022
CSV.Row very slow for reading files line by line Performance package , csv	0	282	May 9, 2023

Skipping a lot of lines in CSV.read() allocates too much memory

Related topics