Skipping a lot of lines in CSV.read() allocates too much memory

mnemnion · February 13, 2024, 4:43pm

“CSV” in practice refers to a wide variety of related formats. But something which doesn’t have a header followed by rows of columns isn’t really a CSV, it’s more like a bunch of CSVs concatenated together. Knowing more about the format I doubt SQLite would succeed in reading it in without some custom code, although I suppose it’s worth a shot, SQLite’s capabilities have surprised me before.

tbeason · February 13, 2024, 4:52pm

What you have is not a tabular format, thus the suggestions so far are of limited use. Looks to me like it would be better read as JSON or similar.

I assume that since you are asking about DataFrames and whatnot that you are only interested in what must be those things that look like atom ids and coordinates. You can modify the above procedures to iterate through the file looking for "ITEM: ATOMS id type xs ys zs", then accumulate all of the following lines into an IOBuffer, stopping when you reach the next line beginning with "ITEM", then output the IOBuffer to DataFrame. Rinse and repeat for the next section.

If you wanted to read the file in parallel, you can do 1 pass to get the start/stop lines and then a second pass where you read multiple chunks into DataFrames separately, although you probably need to memory map the file or something for this to work.

aris · February 13, 2024, 5:02pm

Thank you for the suggestion.
I suppose since numpy works well, I might just stick to that for my particular ugly-file scenario. That might be the most elegant solution at this stage.

As for the general, well-behaved case in the MRE above (for which the topic was created), I’m not sure whether we got a answer or not.

mnemnion · February 13, 2024, 5:05pm

Perhaps LAMMPS.jl could be of assistance?

rocco_sprmnt21 · February 13, 2024, 5:07pm

Try starting with a table containing thw samw values in each cell. Then try some variation

CameronBieganek · February 13, 2024, 5:26pm

I would preprocess the data and turn it into a real CSV file. Here is a MWE.

Start with the following LAMMPS file ("LAMMPS.txt"):

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
2000
ITEM: BOX BOUNDS pp pp pp
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0.239375 0.505765 0.143033
2 2 0.254461 0.510929 0.143544
3 2 0.240602 0.489869 0.142478
4 3 0.23796 0.503614 0.142924
ITEM: TIMESTEP
1
ITEM: NUMBER OF ATOMS
2000
ITEM: BOX BOUNDS pp pp pp
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0.242619 0.507873 0.136935
2 2 0.253103 0.515054 0.14658
3 2 0.240602 0.489869 0.142478
4 3 0.23796 0.503614 0.142924

Now run the following code:

lines = eachline("LAMMPS.txt")
lines_split = Iterators.map(line -> split(line, ' '), lines)
data_lines_split = Iterators.filter(vec -> length(vec) == 5, lines_split)
data_lines = Iterators.map(vec -> join(vec, ','), data_lines_split)

open("LAMMPS.csv", "w") do io
    for line in data_lines
        println(io, line)
    end
end

Here is the output file:

1,1,0.239375,0.505765,0.143033
2,2,0.254461,0.510929,0.143544
3,2,0.240602,0.489869,0.142478
4,3,0.23796,0.503614,0.142924
1,1,0.242619,0.507873,0.136935
2,2,0.253103,0.515054,0.14658
3,2,0.240602,0.489869,0.142478
4,3,0.23796,0.503614,0.142924

Note that this solution streams through the input file one row at a time, so it should work on files that are larger than the RAM memory you have available on your machine.

Of course, instead of creating a new CSV file, you could try to save the cleaned up data in a Parquet or Arrow file, or as a SQLite or DuckDB table. That would probably require a bit more processing than what I have in the MWE above.

tbeason · February 13, 2024, 5:28pm

I think OP probably wants a DataFrame per block, rather than 1 for the entire file.

CameronBieganek · February 13, 2024, 5:30pm

It’s not clear precisely what the OP wants, but hopefully my MWE demonstrates the general approach that they will need to take.

CameronBieganek · February 13, 2024, 5:38pm

The best approach will be some combination of my MWE with the suggestion that @SteffenPL gave.

rdavis120 · February 13, 2024, 6:17pm

You can specify the schema you’re interested in and skip the errors. Or there are methods to retain errors as well.

DuckDB.execute(db, "SELECT * FROM read_csv('~/LAMMPS.txt', header = false, 
columns = {'id': 'UINTEGER', 'type': 'UINTEGER', 'xs': 'DOUBLE', 'ys': 'DOUBLE', 'zs': 'DOUBLE'},
 ignore_errors = true, auto_detect=false, sep = ' ')") |> DataFrame

rocco_sprmnt21 · February 13, 2024, 6:31pm

I don’t have formal proof of what was hypothesized in the previous message, but the following could be a strong clue

Chunk affects the file size, not the number of lines

julia> CSV.write("chunkdata_az.csv", DataFrame(rand('a':'z',number_of_lines,10), :auto))
"chunkdata_az.csv"

julia> 

julia> for chunk in CSV.Chunks("chunkdata_az.csv", ntasks=steps)
           #a = chunk |> DataFrame
           display(size((chunk)))
       end
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)

julia> df1=DataFrame(rand(1:2,10^4,10),:auto);

julia> df2=DataFrame(rand(10:99,2500,10),:auto);

julia> df=vcat(df1,df2);

julia> CSV.write("chunkdata.csv", df)
"chunkdata.csv"

julia> for chunk in CSV.Chunks("chunkdata.csv", ntasks=3)    
           #a = chunk |> DataFrame
           display(size((chunk)))
       end
(4584,)
(4583,)
(3333,)

julia> h=length(join(names(df),',')*'\n')
31

julia> rdf1= length(join(string.(rand(1:2,10)),',')*'\n')    
20

julia> rdf2= length(join(string.(rand(10:99,10)),',')*'\n')  
30

julia> 4583*rdf1
91660

julia> (3333-2500)*rdf1+rdf2*2500
91660

aris · February 13, 2024, 9:01pm

Thank you for the suggestion @CameronBieganek , that creates a very tidy file indeed, but I’m not sure whether this is relevant here.
First of all, it discards the BOX BOUNDS section, which I would like to keep/utilize **, and also it writes everything into a new file, which I would have to import again in order to do useful things with the data.

**these are nuances relevant to my very particular case, and I’d like to keep this post fairly general, hence I’d rather stick to the MRE in the original post.

Sorry for the confusion, to reiterate:
My files can be very big (like, 10gb big), so I can’t load the whole thing into my memory to do my computations.
My strategy thus, is to divide the file into “chunks”, and loop over them (import block of data, do calculations, move to the next).

As aforementioned, numpy does the job pretty well, using the following:

using CSV, DataFrames

number_of_lines = 10^5
CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))

steps = 10
chunk = round(Int, number_of_lines / steps)

# Numpy version, nice and fast
using PythonCall
np = pyimport("numpy")
for i in 1:steps
    @time a = pyconvert(Matrix,  np.loadtxt("data.csv", skiprows=i *chunk-1, max_rows=chunk,delimiter=','))
    #extra computations using "a" would go somewhere here 
end

# CSV version, painfully slow
for i in 1:steps
    @time a =CSV.read("data.csv", DataFrame, skipto=i * chunk, limit=chunk, delim=',')
end

It is also equally effective for my ugly lammps format example, without altering the file.

The main question is;
Does CSV or another julia package offer numpy’s performance (or better) in this case?

It does feel a bit akward looking at python libraries for speed purposes.
(I know numpy is written in C, but since speed is Julia’s main selling point, this doesn’t make much sense I believe).

rocco_sprmnt21 · February 13, 2024, 9:41pm

Here how to use arrow format instead of csv for better performance is discussed

aris · February 15, 2024, 11:20am

Update:
I reckon this is the fastest solution I found so far;

using CSV, DataFrames

number_of_lines = 10^6
CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))

steps = 10
chunk = round(Int, number_of_lines / steps)

# Preallocate arrays
raw_strings::Vector{String} = [" " for _ in 1:chunk]

open("data.csv") do io
    readline(io) #skip header

    for _ in 1:steps

        @time for i in 1:chunk
            raw_strings[i] = readline(io)
        end

        # split strings and convert to numbers or dataframe
        # do whatever you want with the numbers 
        # move to next block

    end
end

Again, thank you everyone for your inputs.

SteffenPL · February 15, 2024, 11:42am

Just as a note, the raw_strings are not actually preallocating anything, since the line raw_strings[i] = readline(io) will just replace the ith entry of the vector with the new string from readline(io).

Out of curiosity, is this similar to the speed of the numpy approach? Or is it just the fastest Julia solution you benchmarked so far?

aris · February 15, 2024, 12:02pm

Numpy.loadtxt suffers from the same drawback as CSV.read;
That is, for really big files, the first few iterations are lightning fast, but when you dig deeper into the file, and skip a lot of lines, you have to wait more and more (for some pointer to reach the desired line, I suppose?).
That being said, for this application, numy is A LOT faster than csv.

You can check by running the example mentioned in this post using a ridiculously large number of lines / steps.

Using the “open() do io” approach is a little bit slower in the first few iterations (probably because the way I convert the strings into numbers is not really optimised), but there is no slowdown whatsoever, and thus it’s by far the fastest I’ve tried so far for very big files.

rocco_sprmnt21 · February 15, 2024, 9:15pm

Does readline compare to numpy?

PS
have you thought about using readuntil instead of readline.
If you have a block delimiter in your file, you could easily read many lines at a time.

mkitti · February 15, 2024, 11:14pm

The first package I would have considered for this kind of file is DelimitedFiles.jl, the standard library package:
https://docs.julialang.org/en/v1/stdlib/DelimitedFiles/

I usually consider CSV.jl to be a much more optimized package, but that was my first thought after seeing what you are trying to parse.

Since this is apparently a fairly large file, I might be tempted to memory map the file via

using Mmap
raw_text_bytes = Mmap.file("/path/to/file")

Here’s a demonstration with your code.

julia> using CSV, DataFrames

julia> number_of_lines = 10^6
1000000

julia> CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))
"data.csv"

julia> filesize("data.csv")
192701426

julia> using Mmap

julia> @time bytes = mmap("data.csv");
  0.000389 seconds (17 allocations: 1.258 KiB)

julia> length(bytes)
192701426

julia> typeof(bytes)
Vector{UInt8} (alias for Array{UInt8, 1})

The primary advantage of this is that I can now address the files a Vector{UInt8}. This can be a bit deceiving because I have not actually read in the entire file. I have just asked the operating system to map a part of memory to the data in the file.

One thing I could immediately do is wrap these bytes into a String:

julia> @time str = String(bytes)
  0.073851 seconds (1 allocation: 183.774 MiB)
"x1,x2,x3,x4,x5,x6,x7,x8,x9,x10\n0.20525379395107268,0.9146064742772587,0.5211079343285396,0.7405898918376811,0.5800267639439886,0.22153249364934624,0.03482345591410385,0.8843802225385223,0.02337102905145616,0.17956941539426485\n0.29484986977850236,0.6967728601507807,0.9" ⋯ 192700891 bytes ⋯ "111415876886638,0.17683327154449613,0.6676767699457085,0.8606904585229882\n0.7173032125820923,0.7097713832072097,0.7130825101663243,0.3366146394299203,0.22490196540304352,0.1617334169795367,0.9121761896079096,0.14191943750057856,0.8887615924373486,0.22171906127301277\n"

Note that String steals the underlying data buffer from bytes.

Something else that I can do fairly quickly is figure out where the new lines are.

julia> @time bytes = mmap("data.csv");
  0.000375 seconds (17 allocations: 1.258 KiB)

julia> @time newline_idx = findall(==(0x0a), bytes);
  0.049455 seconds (6 allocations: 30.605 MiB)

In doing this, I am cheating quite a bit. I’m using prior knowledge that the text is ASCII meaning that one byte is one character. I’m also not doing any validation. You can read more about this issue at the blog post below by @jakobnissen where he examines a Mojo parsing benchmark and gives a Julia counter-example:

https://viralinstruction.com/posts/mojo/

aris · February 16, 2024, 9:43am

For very small files numpy seems superior (less allocations, but that might be due to bad coding on my side). For skipping a lot of lines, numpy slows down.

Thought about it, but didn’t implement it yet. Thank you for the suggestion.

aris · February 16, 2024, 10:41am

Thank you very much @mkitti for pointing out the mmap function, I wasn’t familiar with it.

Just to clarify something;
If I identify the range I’d like to read in the “bytes” array from your example, String(bytes[x:y]) gives a string of characters from x to y, but is there a way to get numbers instead of strings?
Would I have to parse the String() output into a number instead?
(e.g. parse.(Float64,split(String(bytes[x,y]),',')) )

Just looking for the most efficient way to get numbers, with the least amount of allocations possible.

Topic		Replies	Views
CSV Reading (rewrite in C?) Internals & Design	50	5164	October 1, 2018
Reading a few rows from a BIG CSV file General Usage dataframes , csv , big-data	39	4660	January 18, 2024
Reading Data Is Still Too Slow Data	35	8926	August 2, 2019
CSV read performance vs Pandas General Usage	29	8248	May 6, 2019
CSV.read extremely slow wrt readtable Data	14	3659	July 27, 2018

Skipping a lot of lines in CSV.read() allocates too much memory

Related topics