Skipping a lot of lines in CSV.read() allocates too much memory

,

I think CSV.read mmaps the file anyway, but maybe the overhead of mmap is large enough that doing it once and passing that as input would speed things up

1 Like

A few details to keep in mind:

For very small files numpy seems superior (less allocations

The @time only tracks allocations from Julia. That means for numpy it will only tell how expensive it is to call python, but not how much the python code itself allocated.

(e.g. parse.(Float64,split(String(bytes[x:y]),ā€˜,ā€™)) )

Most likely parse is not the most performant option here. Also bytes[x:y] allocates a new array of bytes, use @view bytes[x:y].

You can directly use DelimitedFiles Ā· DelimitedFiles (juliadata.org) or CSV.read with an Buffer like in my initial post.

For people trying to help it might be useful if you post your benchmark timings. That way one can see how the methods compare without having to fire up a Julia session.


I would also mirror that. Your segments seem to be around 2000 lines each. That means that you will not use the power of nmap anyway.

1 Like

Iā€™m worried that youā€™re doing some premature optimization here. Are the allocations really the main issue here? You probably do want to use a validating parser. That said we can continue.

Letā€™s say we isolate a line. We can then find the commas.

julia> line = bytes[226:418];

julia> commas = findall(==(UInt8(',')), line)
9-element Vector{Int64}:
  20
  40
  59
  79
  98
 118
 136
 155
 174

We now know the first number is contained within the first 19 characters.

julia> String(copy(line[1:19]))
"0.14639761075652213"

julia> line[1:19]
19-element Vector{UInt8}:
 0x30
 0x2e
 0x31
 0x34
 0x36
 0x33
 0x39
 0x37
 0x36
 0x31
 0x30
 0x37
 0x35
 0x36
 0x35
 0x32
 0x32
 0x31
 0x33

To decode the bytes, we need to understand the ASCII value of the digits.

julia> map(0:9) do i
           UInt8("$i"[1])
       end
10-element Vector{UInt8}:
 0x30
 0x31
 0x32
 0x33
 0x34
 0x35
 0x36
 0x37
 0x38
 0x39

julia> map(0:9) do i
           UInt8("$i"[1]) - 0x30 |> Int8
       end 
10-element Vector{Int8}:
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9

Above we see that ā€˜0ā€™ corresponds to 0x30 or 48. Thus we just substract that value from the byte values to get the digits.

julia> line[1:19] .- 0x30 .|> Int
19-element Vector{Int64}:
   0
 254
   1
   4
   6
   3
   9
   7
   6
   1
   0
   7
   5
   6
   5
   2
   2
   1
   3

The exception above is due to the ..

julia> UInt8('.') - 0x30 |> Int
254

The rest of the numbers can be parsed and summed using base 10, and then divided to get the intended value.

julia> sum((line[3:19] .- 0x30) .* (10 .^ (16:-1:0)))
14639761075652213

julia> sum((line[3:19] .- 0x30) .* (10 .^ (16:-1:0))) / 10^17
0.1463976107565221

julia> String(copy(line[1:19]))
"0.14639761075652213"

Combining the steps we could build our own float parser that looks as follows.

julia> function parse_float_no_allocations(bytes::AbstractVector{UInt8}, _start::Int = 1, _end::Int = length(bytes))
           int_val = 0
           power = 0
           for i in _start:_end
               byte = bytes[i]
               if byte != 0x2e
                   int_val *= 10
                   int_val += byte - 0x30
                   power *= 10
               else
                   power = 1
               end
           end
           q = int_val / power 
           return q
       end
parse_float_no_allocations (generic function with 3 methods)

julia> x = rand()
0.04775704047068485

julia> x_bytes = Vector{UInt8}(string(x));

julia> @time parse_float_no_allocations(x_bytes)
  0.000006 seconds (1 allocation: 16 bytes)
0.04775704047068485

julia> using BenchmarkTools

julia> @btime parse_float_no_allocations($x_bytes)
  38.345 ns (0 allocations: 0 bytes)
0.04775704047068485

julia> str = String(copy(x_bytes))
"0.04775704047068485"

julia> @btime parse(Float64, $str)
  156.581 ns (0 allocations: 0 bytes)
0.04775704047068485

Thatā€™s a pretty crude parser, but I think you can see the idea. Iā€™m not sure what the advantage of implementing a parser from scratch would be other than incorporating prior knowledge that you have about your data and cutting corners. Perhaps it would be useful to build a custom high-performance parser around your data format though.

Yes, I think it does try to memory map.

2 Likes

That is a very good point, and it didnā€™t cross my mind, thanks!

Sorry, didnā€™t think about this previously, will definitely keep that in mind for subsequent posts.

Assuming your suggestions refer to this file format and not the MRE in the original post, hereā€™s how Iā€™m currently reading it:

num_lines = countlines(filepath)
num_atoms = 2000
num_steps = (floor(Int, num_lines / (num_atoms + 9) ))

boxlengths::Vector{Float64} = zeros(3)
positions::Vector{Vector{Float64}} = [zeros(3) for _ in 1:num_atoms]

@time open(filepath) do io

    for s in 1:num_steps

        # Skip headers
        for _ in 1:5
            readline(io)
        end

        # Read box bounds
        for i in 1:3
            boxlengths[i] = sum(abs.(parse.(Float64, split(readline(io), ' '))))
        end

        # Skip header
        readline(io)

        # Read positions
        for i in 1:num_atoms
            positions[i] = parse.(Float64, split(readline(io), ' '))[3:5]
        end

        #Do calculations here

    end

end
0.926642 seconds (3.06 M allocations: 217.515 MiB, 2.07% gc time, 5.83% compilation time)

And thatā€™s pretty decent I reckon. I plan to replace vectors with SVectors at some point, assuming that would make it faster overall. Perhaps parse.() could be also replaced, but not sure (yet) exactly how.

Replacing the ā€œ# Skip headersā€ loop with the line readuntil(io,"pp\n") (which does the same thing), gives exactly the same performance.

Would that imply replacing the ā€œ# Read positionsā€ part of the example with a CSV.read(something)?
Not quite sure how this command works in this context if Iā€™m honest.

Very educational post @mkitti, thanks a lot!

Iā€™ll keep that in mind, perhaps Iā€™ll dive into it once I optimize the rest of the code.

You can do it like this

using CSV, DataFrames

open("test.data") do f
    readuntil(f, "ITEM: ")  # skip over first ITEM: string
    while !eof(f)

        boxsize = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame)
        positions = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame)

    end
end

You also get the correct labels from the file format, e.g. you can do stuff like positions.xs etc.

3 Likes

Thatā€™s pretty elegant indeed (and a bit faster), thank you very much for the suggestion!

Out of curiosity, would there be a way of reading the data into a vector of vectors (or even SVectors) instead of a Dataframe?

You can convert it like this:

SVector{3, Float64}.(positions.xs, positions.ys, positions.zs)

During the parsing part there is no performance benefit from using SVector, only for your subsequent processing there might be.

2 Likes

give this a try


r="123.14639761075652213"
dp=findfirst('.',r)
pi= r[1:dp-1]
pd= r[end:-1:dp+1]


PD=reduce((s,c)->(s+(c-'0'))/10,pd,init=0)

PI=reduce((s,c)->s*10+(c-'0'),pi,init=0)

PI+PD

PS
can you tell us if the solution with readuntil is still slower than numpy?
could you post the benchmarks of the different proposals?

Thank you for the suggestion @rocco_sprmnt21.
Could you please explain a bit whatā€™s happening in this code?
How would you use it to read a file? (for example, the MRE in the original post)

Good point, I will prepare a comparisson and post it when I find some time.

FWIW, a version using a mutable structure to collect the data (*edited with roccoā€™s append! suggestion):

using CSV, DataFrames

file = raw"C:\Users\...\LAMMPS_test.data.txt"

Base.@kwdef mutable struct MyData
    timestep::Vector{Int} = Int[]
    natoms::Int = 0
    boxsize::Matrix{Float64} = Array{Float64}(undef, 0, 0)
    positions::DataFrame = DataFrame()
end

M = MyData()

# Read repeated items only once:
open(file) do f
    readuntil(f, "ITEM: NUMBER OF ATOMS")
    M.natoms = parse(Int, readuntil(f, "ITEM: BOX BOUNDS pp pp pp"))
    M.boxsize = Matrix{Float64}(CSV.read(IOBuffer(readuntil(f, "ITEM: ATOMS")), DataFrame, header=false))
end

# Read varying items:
open(file) do f
    readline(f)
    while !eof(f)
        push!(M.timestep, parse(Int, readuntil(f, "ITEM: NUMBER OF ATOMS")))
        readuntil(f, "ITEM: ATOMS")
        append!(M.positions, CSV.read(IOBuffer(readuntil(f, "ITEM: TIMESTEP")), DataFrame; delim=' ', ignorerepeated=true))
    end
end

# Access loaded data with:
M.timestep
M.natoms
M.boxsize
M.positions
M.positions.zs

Out of interest did you attempt the arrow approach - further, did you know that you can store as a valid struct type in the arrow format, so could convert into rows of lammps(?) and retrieve by index after the fact?

You could essentially do: CSV ā†’ Arrow of lammps rows. If you then decided to uniquely identify them and store as a dictionary (of lammp data key to row index) you could grab almost instantly on demand and then act on in your downstream code.

Regards,

Thank you very much for this suggestion @rafael.guerra.

This solution is a bit slower than the one in this post, however, the results are very tidy! I could see myself using a sweet spot between performance and tidiness, depending on the demands of my final code.

Here are some rough benchmarks (for a file of 403809 lines, using my machine):

Most efficient (from this post, adapted from Steffenā€™s): 1 second
Mutable structure (from this post): 2 seconds
Numpy: 5.5 seconds, using the following code

for i in 1:steps
    a = pyconvert(Matrix, np.loadtxt(filepath, skiprows= floor(Int,(i-1)*(2000+9) + 9) , max_rows=2000, delimiter=' '))
end

CSV.read: using the naive implementation of the original post (as seen below), makes my laptop scream and doesnā€™t finish even if I leave it running for over a couple of minutes.

for i in 1:steps
    a=CSV.read(filepath,DataFrame,skipto=floor(Int,(i-1)*(2000+9) + 10) ,limit=2000 ,header=false,delim=' ')
end

When I find some time, I also have to test the SQlite method mentioned by @mnemnion, DuckDB mentioned by @rdavis120, and arrows mentioned by @rocco_sprmnt21, but that would require some homework, hopefully Iā€™ll be able to test them soon.
Also, thank you @djholiver for the aditional information on arrows, Iā€™ll consider that!

Didnā€™t expect so many options to be available, props to everyone whoā€™s putting in the time and effort to develop the Julia ecosystem. Kudos!

Could you post the final code using Steffenā€™s solution?

How does the dataframe created with readuntil() accumulate the different chunks? When I try it, it seems to replace the previous chunk.

PS: This was the reason I put an ugly vcat in my code example.

To your comment on the ecosystem - is this usecase repeatable or broadly used enough to justify a package? The lammps format appears to be well known.

Aside from that - I canā€™t see any data in the posts about which to create a code snippet around and arrow version. If you are able to post some I might have the time to provide a benchmark of the above suggestion if that would prove useful?

See this post.

1 Like

@rafael.guerra Here is the code I ended up using (which I corrected slightly from the original post).

using CSV, DataFrames

open(filepath) do f
    while !eof(f)
        readuntil(f, "ITEM: BOX BOUNDS")
        boxsize = CSV.read(IOBuffer(readuntil(f, "ITEM: ATOMS")), DataFrame, header=false, skipto=2)
        positions = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame, header=false, skipto=2)

    end
end

@djholiver Here is a link to a file using the lammps format. For future reference, this is a OneDrive link and it will expire in one month if Iā€™m not mistaken, sorry but I donā€™t know if I could keep the link active forever.

As for this comment, thatā€™s just genuine appreciation for everyone who provides me with the tools to do my job in an efficient way, for free. I wouldnā€™t dare ask for something thatā€™s more specific to the lammps format, I think whatā€™s available already would be more than enough.
This is why I made the original post using a generic MRE (hereā€™s the general solution, for anyone whoā€™s lost with all the posts in this thread), to find something that can be used by anyone with a similar problem.

Nevertheless, I really do appreciate how you guys offer to help with my specific example.

1 Like

This doesnā€™t seem to give a complete answer to @rafael.guerraā€™s question, at least as I understand it.
Iā€™ll try to make the point clear. There is a text file (made up of various pieces of non-homogeneous tables) that is too large to be loaded into memory and processed.
It is then loaded in chuncks and processed (how? The detail of this aspect is missing) what happens to the various processed pieces? Are they aggregations of the original pieces and therefore, being reduced, can they be combined into a single table or must different tables be created (and saved separately)?
Other?
PS
Using append! instead of vcat in @rafael.guerraā€™s snippet the times could be shortened

2 Likes

Thanks for the link. Hopefully Iā€™ll get some time to have a look.

Appears I was unclear - it was a genuine question if this should end up as a package given the data format and possible usages in scientific fields.

Thanks for the insight @rocco_sprmnt21, I didnā€™t read @rafael.guerraā€™s post carefully enough.

In my particular case, I would like to do some calculations using the ā€œchunkā€ of data, then replace them by the next chunk, effectively discarding them. I donā€™t want to accumulate them into another array, since it would be too much for my RAM, if I were to use a full example.

I suppose that might be far from the topic of this thread, thatā€™s why I avoided it. To satisfy anyoneā€™s curiosity, I would apply a series of functions to the data, which would lead to a 3-element vector for each of the chunks, which is then saved in a vector of vectors, everything else is discarded. Iā€™m trying to calculate some correlation functions basically.

I appreciate it!

I donā€™t think I can answer that unfortunately. LAMMPS is just one among many different software packages people use to perform molecular dynamics simulations. It is the only one Iā€™ve used so far, and Iā€™m very much a beginner in this field. I reckon different packages could have their own unique output data formats, following different conventions. Usually, people working in the field are fairly comfortable with programming, and they would surely be able to write efficient code to import their data, or perhaps modify the format exported by the MD software.