Skipping a lot of lines in CSV.read() allocates too much memory

tbeason · February 16, 2024, 12:06pm

I think CSV.read mmaps the file anyway, but maybe the overhead of mmap is large enough that doing it once and passing that as input would speed things up

SteffenPL · February 16, 2024, 12:07pm

A few details to keep in mind:

For very small files numpy seems superior (less allocations

The @time only tracks allocations from Julia. That means for numpy it will only tell how expensive it is to call python, but not how much the python code itself allocated.

(e.g. parse.(Float64,split(String(bytes[x:y]),‘,’)) )

Most likely parse is not the most performant option here. Also bytes[x:y] allocates a new array of bytes, use @view bytes[x:y].

You can directly use DelimitedFiles · DelimitedFiles (juliadata.org) or CSV.read with an Buffer like in my initial post.

For people trying to help it might be useful if you post your benchmark timings. That way one can see how the methods compare without having to fire up a Julia session.

I would also mirror that. Your segments seem to be around 2000 lines each. That means that you will not use the power of nmap anyway.

mkitti · February 16, 2024, 1:06pm

I’m worried that you’re doing some premature optimization here. Are the allocations really the main issue here? You probably do want to use a validating parser. That said we can continue.

Let’s say we isolate a line. We can then find the commas.

julia> line = bytes[226:418];

julia> commas = findall(==(UInt8(',')), line)
9-element Vector{Int64}:
  20
  40
  59
  79
  98
 118
 136
 155
 174

We now know the first number is contained within the first 19 characters.

julia> String(copy(line[1:19]))
"0.14639761075652213"

julia> line[1:19]
19-element Vector{UInt8}:
 0x30
 0x2e
 0x31
 0x34
 0x36
 0x33
 0x39
 0x37
 0x36
 0x31
 0x30
 0x37
 0x35
 0x36
 0x35
 0x32
 0x32
 0x31
 0x33

To decode the bytes, we need to understand the ASCII value of the digits.

julia> map(0:9) do i
           UInt8("$i"[1])
       end
10-element Vector{UInt8}:
 0x30
 0x31
 0x32
 0x33
 0x34
 0x35
 0x36
 0x37
 0x38
 0x39

julia> map(0:9) do i
           UInt8("$i"[1]) - 0x30 |> Int8
       end 
10-element Vector{Int8}:
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9

Above we see that ‘0’ corresponds to 0x30 or 48. Thus we just substract that value from the byte values to get the digits.

julia> line[1:19] .- 0x30 .|> Int
19-element Vector{Int64}:
   0
 254
   1
   4
   6
   3
   9
   7
   6
   1
   0
   7
   5
   6
   5
   2
   2
   1
   3

The exception above is due to the ..

julia> UInt8('.') - 0x30 |> Int
254

The rest of the numbers can be parsed and summed using base 10, and then divided to get the intended value.

julia> sum((line[3:19] .- 0x30) .* (10 .^ (16:-1:0)))
14639761075652213

julia> sum((line[3:19] .- 0x30) .* (10 .^ (16:-1:0))) / 10^17
0.1463976107565221

julia> String(copy(line[1:19]))
"0.14639761075652213"

Combining the steps we could build our own float parser that looks as follows.

julia> function parse_float_no_allocations(bytes::AbstractVector{UInt8}, _start::Int = 1, _end::Int = length(bytes))
           int_val = 0
           power = 0
           for i in _start:_end
               byte = bytes[i]
               if byte != 0x2e
                   int_val *= 10
                   int_val += byte - 0x30
                   power *= 10
               else
                   power = 1
               end
           end
           q = int_val / power 
           return q
       end
parse_float_no_allocations (generic function with 3 methods)

julia> x = rand()
0.04775704047068485

julia> x_bytes = Vector{UInt8}(string(x));

julia> @time parse_float_no_allocations(x_bytes)
  0.000006 seconds (1 allocation: 16 bytes)
0.04775704047068485

julia> using BenchmarkTools

julia> @btime parse_float_no_allocations($x_bytes)
  38.345 ns (0 allocations: 0 bytes)
0.04775704047068485

julia> str = String(copy(x_bytes))
"0.04775704047068485"

julia> @btime parse(Float64, $str)
  156.581 ns (0 allocations: 0 bytes)
0.04775704047068485

That’s a pretty crude parser, but I think you can see the idea. I’m not sure what the advantage of implementing a parser from scratch would be other than incorporating prior knowledge that you have about your data and cutting corners. Perhaps it would be useful to build a custom high-performance parser around your data format though.

Yes, I think it does try to memory map.

github.com

JuliaData/CSV.jl/blob/849f17ff2cb8ee48325334ed410ed4debc6c838c/src/utils.jl#L273


      
              end
          elseif x isa Cmd || x isa IO
              if buffer_in_memory
                  buf = Base.read(x isa Cmd ? open(x) : x)
              else
                  buf, tfile = buffer_to_tempfile(CodecZlib.TranscodingStreams.Noop(), x isa Cmd ? open(x) : x)
              end
              return buf, 1, length(buf), tfile
          else
              try
                  buf = Mmap.mmap(string(x))
                  return buf, 1, length(buf), tfile
              catch e
                  # if we can't mmap, try just buffering the whole thing into a tempfile byte vector
                  if buffer_in_memory
                      buf = Base.read(x)
                  else
                      buf, tfile = buffer_to_tempfile(CodecZlib.TranscodingStreams.Noop(), open(x))
                  end
                  return buf, 1, length(buf), tfile
              end

aris · February 16, 2024, 1:20pm

That is a very good point, and it didn’t cross my mind, thanks!

Sorry, didn’t think about this previously, will definitely keep that in mind for subsequent posts.

Assuming your suggestions refer to this file format and not the MRE in the original post, here’s how I’m currently reading it:

num_lines = countlines(filepath)
num_atoms = 2000
num_steps = (floor(Int, num_lines / (num_atoms + 9) ))

boxlengths::Vector{Float64} = zeros(3)
positions::Vector{Vector{Float64}} = [zeros(3) for _ in 1:num_atoms]

@time open(filepath) do io

    for s in 1:num_steps

        # Skip headers
        for _ in 1:5
            readline(io)
        end

        # Read box bounds
        for i in 1:3
            boxlengths[i] = sum(abs.(parse.(Float64, split(readline(io), ' '))))
        end

        # Skip header
        readline(io)

        # Read positions
        for i in 1:num_atoms
            positions[i] = parse.(Float64, split(readline(io), ' '))[3:5]
        end

        #Do calculations here

    end

end

0.926642 seconds (3.06 M allocations: 217.515 MiB, 2.07% gc time, 5.83% compilation time)

And that’s pretty decent I reckon. I plan to replace vectors with SVectors at some point, assuming that would make it faster overall. Perhaps parse.() could be also replaced, but not sure (yet) exactly how.

Replacing the “# Skip headers” loop with the line readuntil(io,"pp\n") (which does the same thing), gives exactly the same performance.

Would that imply replacing the “# Read positions” part of the example with a CSV.read(something)?
Not quite sure how this command works in this context if I’m honest.

aris · February 16, 2024, 1:25pm

Very educational post @mkitti, thanks a lot!

I’ll keep that in mind, perhaps I’ll dive into it once I optimize the rest of the code.

SteffenPL · February 16, 2024, 1:51pm

You can do it like this

using CSV, DataFrames

open("test.data") do f
    readuntil(f, "ITEM: ")  # skip over first ITEM: string
    while !eof(f)

        boxsize = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame)
        positions = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame)

    end
end

You also get the correct labels from the file format, e.g. you can do stuff like positions.xs etc.

aris · February 16, 2024, 2:17pm

That’s pretty elegant indeed (and a bit faster), thank you very much for the suggestion!

Out of curiosity, would there be a way of reading the data into a vector of vectors (or even SVectors) instead of a Dataframe?

SteffenPL · February 16, 2024, 2:27pm

You can convert it like this:

SVector{3, Float64}.(positions.xs, positions.ys, positions.zs)

During the parsing part there is no performance benefit from using SVector, only for your subsequent processing there might be.

rocco_sprmnt21 · February 16, 2024, 5:24pm

give this a try


r="123.14639761075652213"
dp=findfirst('.',r)
pi= r[1:dp-1]
pd= r[end:-1:dp+1]


PD=reduce((s,c)->(s+(c-'0'))/10,pd,init=0)

PI=reduce((s,c)->s*10+(c-'0'),pi,init=0)

PI+PD

PS
can you tell us if the solution with readuntil is still slower than numpy?
could you post the benchmarks of the different proposals?

aris · February 16, 2024, 7:22pm

Thank you for the suggestion @rocco_sprmnt21.
Could you please explain a bit what’s happening in this code?
How would you use it to read a file? (for example, the MRE in the original post)

Good point, I will prepare a comparisson and post it when I find some time.

rafael.guerra · February 18, 2024, 9:55am

FWIW, a version using a mutable structure to collect the data (*edited with rocco’s append! suggestion):

using CSV, DataFrames

file = raw"C:\Users\...\LAMMPS_test.data.txt"

Base.@kwdef mutable struct MyData
    timestep::Vector{Int} = Int[]
    natoms::Int = 0
    boxsize::Matrix{Float64} = Array{Float64}(undef, 0, 0)
    positions::DataFrame = DataFrame()
end

M = MyData()

# Read repeated items only once:
open(file) do f
    readuntil(f, "ITEM: NUMBER OF ATOMS")
    M.natoms = parse(Int, readuntil(f, "ITEM: BOX BOUNDS pp pp pp"))
    M.boxsize = Matrix{Float64}(CSV.read(IOBuffer(readuntil(f, "ITEM: ATOMS")), DataFrame, header=false))
end

# Read varying items:
open(file) do f
    readline(f)
    while !eof(f)
        push!(M.timestep, parse(Int, readuntil(f, "ITEM: NUMBER OF ATOMS")))
        readuntil(f, "ITEM: ATOMS")
        append!(M.positions, CSV.read(IOBuffer(readuntil(f, "ITEM: TIMESTEP")), DataFrame; delim=' ', ignorerepeated=true))
    end
end

# Access loaded data with:
M.timestep
M.natoms
M.boxsize
M.positions
M.positions.zs

djholiver · February 18, 2024, 11:58am

Out of interest did you attempt the arrow approach - further, did you know that you can store as a valid struct type in the arrow format, so could convert into rows of lammps(?) and retrieve by index after the fact?

You could essentially do: CSV → Arrow of lammps rows. If you then decided to uniquely identify them and store as a dictionary (of lammp data key to row index) you could grab almost instantly on demand and then act on in your downstream code.

Regards,

aris · February 18, 2024, 12:07pm

Thank you very much for this suggestion @rafael.guerra.

This solution is a bit slower than the one in this post, however, the results are very tidy! I could see myself using a sweet spot between performance and tidiness, depending on the demands of my final code.

Here are some rough benchmarks (for a file of 403809 lines, using my machine):

Most efficient (from this post, adapted from Steffen’s): 1 second
Mutable structure (from this post): 2 seconds
Numpy: 5.5 seconds, using the following code

for i in 1:steps
    a = pyconvert(Matrix, np.loadtxt(filepath, skiprows= floor(Int,(i-1)*(2000+9) + 9) , max_rows=2000, delimiter=' '))
end

CSV.read: using the naive implementation of the original post (as seen below), makes my laptop scream and doesn’t finish even if I leave it running for over a couple of minutes.

for i in 1:steps
    a=CSV.read(filepath,DataFrame,skipto=floor(Int,(i-1)*(2000+9) + 10) ,limit=2000 ,header=false,delim=' ')
end

When I find some time, I also have to test the SQlite method mentioned by @mnemnion, DuckDB mentioned by @rdavis120, and arrows mentioned by @rocco_sprmnt21, but that would require some homework, hopefully I’ll be able to test them soon.
Also, thank you @djholiver for the aditional information on arrows, I’ll consider that!

Didn’t expect so many options to be available, props to everyone who’s putting in the time and effort to develop the Julia ecosystem. Kudos!

rafael.guerra · February 18, 2024, 1:21pm

Could you post the final code using Steffen’s solution?

How does the dataframe created with readuntil() accumulate the different chunks? When I try it, it seems to replace the previous chunk.

PS: This was the reason I put an ugly vcat in my code example.

djholiver · February 18, 2024, 1:54pm

To your comment on the ecosystem - is this usecase repeatable or broadly used enough to justify a package? The lammps format appears to be well known.

Aside from that - I can’t see any data in the posts about which to create a code snippet around and arrow version. If you are able to post some I might have the time to provide a benchmark of the above suggestion if that would prove useful?

rafael.guerra · February 18, 2024, 2:01pm

See this post.

aris · February 18, 2024, 2:16pm

@rafael.guerra Here is the code I ended up using (which I corrected slightly from the original post).

using CSV, DataFrames

open(filepath) do f
    while !eof(f)
        readuntil(f, "ITEM: BOX BOUNDS")
        boxsize = CSV.read(IOBuffer(readuntil(f, "ITEM: ATOMS")), DataFrame, header=false, skipto=2)
        positions = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame, header=false, skipto=2)

    end
end

@djholiver Here is a link to a file using the lammps format. For future reference, this is a OneDrive link and it will expire in one month if I’m not mistaken, sorry but I don’t know if I could keep the link active forever.

As for this comment, that’s just genuine appreciation for everyone who provides me with the tools to do my job in an efficient way, for free. I wouldn’t dare ask for something that’s more specific to the lammps format, I think what’s available already would be more than enough.
This is why I made the original post using a generic MRE (here’s the general solution, for anyone who’s lost with all the posts in this thread), to find something that can be used by anyone with a similar problem.

Nevertheless, I really do appreciate how you guys offer to help with my specific example.

rocco_sprmnt21 · February 18, 2024, 3:59pm

This doesn’t seem to give a complete answer to @rafael.guerra’s question, at least as I understand it.
I’ll try to make the point clear. There is a text file (made up of various pieces of non-homogeneous tables) that is too large to be loaded into memory and processed.
It is then loaded in chuncks and processed (how? The detail of this aspect is missing) what happens to the various processed pieces? Are they aggregations of the original pieces and therefore, being reduced, can they be combined into a single table or must different tables be created (and saved separately)?
Other?
PS
Using append! instead of vcat in @rafael.guerra’s snippet the times could be shortened

djholiver · February 18, 2024, 4:05pm

Thanks for the link. Hopefully I’ll get some time to have a look.

Appears I was unclear - it was a genuine question if this should end up as a package given the data format and possible usages in scientific fields.

aris · February 18, 2024, 4:45pm

Thanks for the insight @rocco_sprmnt21, I didn’t read @rafael.guerra’s post carefully enough.

In my particular case, I would like to do some calculations using the “chunk” of data, then replace them by the next chunk, effectively discarding them. I don’t want to accumulate them into another array, since it would be too much for my RAM, if I were to use a full example.

I suppose that might be far from the topic of this thread, that’s why I avoided it. To satisfy anyone’s curiosity, I would apply a series of functions to the data, which would lead to a 3-element vector for each of the chunks, which is then saved in a vector of vectors, everything else is discarded. I’m trying to calculate some correlation functions basically.

I appreciate it!

I don’t think I can answer that unfortunately. LAMMPS is just one among many different software packages people use to perform molecular dynamics simulations. It is the only one I’ve used so far, and I’m very much a beginner in this field. I reckon different packages could have their own unique output data formats, following different conventions. Usually, people working in the field are fairly comfortable with programming, and they would surely be able to write efficient code to import their data, or perhaps modify the format exported by the MD software.

Topic		Replies	Views
CSV Reading (rewrite in C?) Internals & Design	50	5131	October 1, 2018
Reading a few rows from a BIG CSV file General Usage dataframes , csv , big-data	39	4648	January 18, 2024
Reading Data Is Still Too Slow Data	35	8903	August 2, 2019
CSV read performance vs Pandas General Usage	29	8240	May 6, 2019
CSV.read extremely slow wrt readtable Data	14	3655	July 27, 2018

Skipping a lot of lines in CSV.read() allocates too much memory

Related topics