Skipping a lot of lines in CSV.read() allocates too much memory

,

“CSV” in practice refers to a wide variety of related formats. But something which doesn’t have a header followed by rows of columns isn’t really a CSV, it’s more like a bunch of CSVs concatenated together. Knowing more about the format I doubt SQLite would succeed in reading it in without some custom code, although I suppose it’s worth a shot, SQLite’s capabilities have surprised me before.

1 Like

What you have is not a tabular format, thus the suggestions so far are of limited use. Looks to me like it would be better read as JSON or similar.

I assume that since you are asking about DataFrames and whatnot that you are only interested in what must be those things that look like atom ids and coordinates. You can modify the above procedures to iterate through the file looking for "ITEM: ATOMS id type xs ys zs", then accumulate all of the following lines into an IOBuffer, stopping when you reach the next line beginning with "ITEM", then output the IOBuffer to DataFrame. Rinse and repeat for the next section.

If you wanted to read the file in parallel, you can do 1 pass to get the start/stop lines and then a second pass where you read multiple chunks into DataFrames separately, although you probably need to memory map the file or something for this to work.

1 Like

Thank you for the suggestion.
I suppose since numpy works well, I might just stick to that for my particular ugly-file scenario. That might be the most elegant solution at this stage.

As for the general, well-behaved case in the MRE above (for which the topic was created), I’m not sure whether we got a answer or not.

Perhaps LAMMPS.jl could be of assistance?

Try starting with a table containing thw samw values in each cell. Then try some variation

I would preprocess the data and turn it into a real CSV file. Here is a MWE.

Start with the following LAMMPS file ("LAMMPS.txt"):

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
2000
ITEM: BOX BOUNDS pp pp pp
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0.239375 0.505765 0.143033
2 2 0.254461 0.510929 0.143544
3 2 0.240602 0.489869 0.142478
4 3 0.23796 0.503614 0.142924
ITEM: TIMESTEP
1
ITEM: NUMBER OF ATOMS
2000
ITEM: BOX BOUNDS pp pp pp
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
-3.0000000000000000e+01 3.0000000000000000e+01
ITEM: ATOMS id type xs ys zs
1 1 0.242619 0.507873 0.136935
2 2 0.253103 0.515054 0.14658
3 2 0.240602 0.489869 0.142478
4 3 0.23796 0.503614 0.142924

Now run the following code:

lines = eachline("LAMMPS.txt")
lines_split = Iterators.map(line -> split(line, ' '), lines)
data_lines_split = Iterators.filter(vec -> length(vec) == 5, lines_split)
data_lines = Iterators.map(vec -> join(vec, ','), data_lines_split)

open("LAMMPS.csv", "w") do io
    for line in data_lines
        println(io, line)
    end
end

Here is the output file:

1,1,0.239375,0.505765,0.143033
2,2,0.254461,0.510929,0.143544
3,2,0.240602,0.489869,0.142478
4,3,0.23796,0.503614,0.142924
1,1,0.242619,0.507873,0.136935
2,2,0.253103,0.515054,0.14658
3,2,0.240602,0.489869,0.142478
4,3,0.23796,0.503614,0.142924

Note that this solution streams through the input file one row at a time, so it should work on files that are larger than the RAM memory you have available on your machine.

Of course, instead of creating a new CSV file, you could try to save the cleaned up data in a Parquet or Arrow file, or as a SQLite or DuckDB table. That would probably require a bit more processing than what I have in the MWE above.

I think OP probably wants a DataFrame per block, rather than 1 for the entire file.

1 Like

It’s not clear precisely what the OP wants, but hopefully my MWE demonstrates the general approach that they will need to take.

The best approach will be some combination of my MWE with the suggestion that @SteffenPL gave.

You can specify the schema you’re interested in and skip the errors. Or there are methods to retain errors as well.

DuckDB.execute(db, "SELECT * FROM read_csv('~/LAMMPS.txt', header = false, 
columns = {'id': 'UINTEGER', 'type': 'UINTEGER', 'xs': 'DOUBLE', 'ys': 'DOUBLE', 'zs': 'DOUBLE'},
 ignore_errors = true, auto_detect=false, sep = ' ')") |> DataFrame

I don’t have formal proof of what was hypothesized in the previous message, but the following could be a strong clue

Chunk affects the file size, not the number of lines
julia> CSV.write("chunkdata_az.csv", DataFrame(rand('a':'z',number_of_lines,10), :auto))
"chunkdata_az.csv"

julia> 

julia> for chunk in CSV.Chunks("chunkdata_az.csv", ntasks=steps)
           #a = chunk |> DataFrame
           display(size((chunk)))
       end
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)

julia> df1=DataFrame(rand(1:2,10^4,10),:auto);

julia> df2=DataFrame(rand(10:99,2500,10),:auto);

julia> df=vcat(df1,df2);

julia> CSV.write("chunkdata.csv", df)
"chunkdata.csv"

julia> for chunk in CSV.Chunks("chunkdata.csv", ntasks=3)    
           #a = chunk |> DataFrame
           display(size((chunk)))
       end
(4584,)
(4583,)
(3333,)

julia> h=length(join(names(df),',')*'\n')
31

julia> rdf1= length(join(string.(rand(1:2,10)),',')*'\n')    
20

julia> rdf2= length(join(string.(rand(10:99,10)),',')*'\n')  
30

julia> 4583*rdf1
91660

julia> (3333-2500)*rdf1+rdf2*2500
91660
1 Like

Thank you for the suggestion @CameronBieganek , that creates a very tidy file indeed, but I’m not sure whether this is relevant here.
First of all, it discards the BOX BOUNDS section, which I would like to keep/utilize **, and also it writes everything into a new file, which I would have to import again in order to do useful things with the data.

**these are nuances relevant to my very particular case, and I’d like to keep this post fairly general, hence I’d rather stick to the MRE in the original post.

Sorry for the confusion, to reiterate:
My files can be very big (like, 10gb big), so I can’t load the whole thing into my memory to do my computations.
My strategy thus, is to divide the file into “chunks”, and loop over them (import block of data, do calculations, move to the next).

As aforementioned, numpy does the job pretty well, using the following:

using CSV, DataFrames

number_of_lines = 10^5
CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))

steps = 10
chunk = round(Int, number_of_lines / steps)

# Numpy version, nice and fast
using PythonCall
np = pyimport("numpy")
for i in 1:steps
    @time a = pyconvert(Matrix,  np.loadtxt("data.csv", skiprows=i *chunk-1, max_rows=chunk,delimiter=','))
    #extra computations using "a" would go somewhere here 
end

# CSV version, painfully slow
for i in 1:steps
    @time a =CSV.read("data.csv", DataFrame, skipto=i * chunk, limit=chunk, delim=',')
end

It is also equally effective for my ugly lammps format example, without altering the file.

The main question is;
Does CSV or another julia package offer numpy’s performance (or better) in this case?

It does feel a bit akward looking at python libraries for speed purposes.
(I know numpy is written in C, but since speed is Julia’s main selling point, this doesn’t make much sense I believe).

Here how to use arrow format instead of csv for better performance is discussed

2 Likes

Update:
I reckon this is the fastest solution I found so far;

using CSV, DataFrames

number_of_lines = 10^6
CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))

steps = 10
chunk = round(Int, number_of_lines / steps)

# Preallocate arrays
raw_strings::Vector{String} = [" " for _ in 1:chunk]

open("data.csv") do io
    readline(io) #skip header

    for _ in 1:steps

        @time for i in 1:chunk
            raw_strings[i] = readline(io)
        end

        # split strings and convert to numbers or dataframe
        # do whatever you want with the numbers 
        # move to next block

    end
end

Again, thank you everyone for your inputs.

Just as a note, the raw_strings are not actually preallocating anything, since the line raw_strings[i] = readline(io) will just replace the ith entry of the vector with the new string from readline(io).

Out of curiosity, is this similar to the speed of the numpy approach? Or is it just the fastest Julia solution you benchmarked so far?

Numpy.loadtxt suffers from the same drawback as CSV.read;
That is, for really big files, the first few iterations are lightning fast, but when you dig deeper into the file, and skip a lot of lines, you have to wait more and more (for some pointer to reach the desired line, I suppose?).
That being said, for this application, numy is A LOT faster than csv.

You can check by running the example mentioned in this post using a ridiculously large number of lines / steps.

Using the “open() do io” approach is a little bit slower in the first few iterations (probably because the way I convert the strings into numbers is not really optimised), but there is no slowdown whatsoever, and thus it’s by far the fastest I’ve tried so far for very big files.

Does readline compare to numpy?

PS
have you thought about using readuntil instead of readline.
If you have a block delimiter in your file, you could easily read many lines at a time.

The first package I would have considered for this kind of file is DelimitedFiles.jl, the standard library package:
https://docs.julialang.org/en/v1/stdlib/DelimitedFiles/

I usually consider CSV.jl to be a much more optimized package, but that was my first thought after seeing what you are trying to parse.

Since this is apparently a fairly large file, I might be tempted to memory map the file via

using Mmap
raw_text_bytes = Mmap.file("/path/to/file")

Here’s a demonstration with your code.

julia> using CSV, DataFrames

julia> number_of_lines = 10^6
1000000

julia> CSV.write("data.csv", DataFrame(rand(number_of_lines, 10), :auto))
"data.csv"

julia> filesize("data.csv")
192701426

julia> using Mmap

julia> @time bytes = mmap("data.csv");
  0.000389 seconds (17 allocations: 1.258 KiB)

julia> length(bytes)
192701426

julia> typeof(bytes)
Vector{UInt8} (alias for Array{UInt8, 1})

The primary advantage of this is that I can now address the files a Vector{UInt8}. This can be a bit deceiving because I have not actually read in the entire file. I have just asked the operating system to map a part of memory to the data in the file.

One thing I could immediately do is wrap these bytes into a String:

julia> @time str = String(bytes)
  0.073851 seconds (1 allocation: 183.774 MiB)
"x1,x2,x3,x4,x5,x6,x7,x8,x9,x10\n0.20525379395107268,0.9146064742772587,0.5211079343285396,0.7405898918376811,0.5800267639439886,0.22153249364934624,0.03482345591410385,0.8843802225385223,0.02337102905145616,0.17956941539426485\n0.29484986977850236,0.6967728601507807,0.9" ⋯ 192700891 bytes ⋯ "111415876886638,0.17683327154449613,0.6676767699457085,0.8606904585229882\n0.7173032125820923,0.7097713832072097,0.7130825101663243,0.3366146394299203,0.22490196540304352,0.1617334169795367,0.9121761896079096,0.14191943750057856,0.8887615924373486,0.22171906127301277\n"

Note that String steals the underlying data buffer from bytes.

Something else that I can do fairly quickly is figure out where the new lines are.

julia> @time bytes = mmap("data.csv");
  0.000375 seconds (17 allocations: 1.258 KiB)

julia> @time newline_idx = findall(==(0x0a), bytes);
  0.049455 seconds (6 allocations: 30.605 MiB)

In doing this, I am cheating quite a bit. I’m using prior knowledge that the text is ASCII meaning that one byte is one character. I’m also not doing any validation. You can read more about this issue at the blog post below by @jakobnissen where he examines a Mojo parsing benchmark and gives a Julia counter-example:

https://viralinstruction.com/posts/mojo/

2 Likes

For very small files numpy seems superior (less allocations, but that might be due to bad coding on my side). For skipping a lot of lines, numpy slows down.

Thought about it, but didn’t implement it yet. Thank you for the suggestion.

Thank you very much @mkitti for pointing out the mmap function, I wasn’t familiar with it.

Just to clarify something;
If I identify the range I’d like to read in the “bytes” array from your example, String(bytes[x:y]) gives a string of characters from x to y, but is there a way to get numbers instead of strings?
Would I have to parse the String() output into a number instead?
(e.g. parse.(Float64,split(String(bytes[x,y]),',')) )

Just looking for the most efficient way to get numbers, with the least amount of allocations possible.