I think CSV.read mmaps the file anyway, but maybe the overhead of mmap is large enough that doing it once and passing that as input would speed things up
A few details to keep in mind:
For very small files numpy seems superior (less allocations
The @time
only tracks allocations from Julia. That means for numpy
it will only tell how expensive it is to call python, but not how much the python code itself allocated.
(e.g. parse.(Float64,split(String(bytes[x:y]),ā,ā)) )
Most likely parse
is not the most performant option here. Also bytes[x:y]
allocates a new array of bytes, use @view bytes[x:y]
.
You can directly use DelimitedFiles Ā· DelimitedFiles (juliadata.org) or CSV.read
with an Buffer
like in my initial post.
For people trying to help it might be useful if you post your benchmark timings. That way one can see how the methods compare without having to fire up a Julia session.
I would also mirror that. Your segments seem to be around 2000
lines each. That means that you will not use the power of nmap
anyway.
Iām worried that youāre doing some premature optimization here. Are the allocations really the main issue here? You probably do want to use a validating parser. That said we can continue.
Letās say we isolate a line. We can then find the commas.
julia> line = bytes[226:418];
julia> commas = findall(==(UInt8(',')), line)
9-element Vector{Int64}:
20
40
59
79
98
118
136
155
174
We now know the first number is contained within the first 19 characters.
julia> String(copy(line[1:19]))
"0.14639761075652213"
julia> line[1:19]
19-element Vector{UInt8}:
0x30
0x2e
0x31
0x34
0x36
0x33
0x39
0x37
0x36
0x31
0x30
0x37
0x35
0x36
0x35
0x32
0x32
0x31
0x33
To decode the bytes, we need to understand the ASCII value of the digits.
julia> map(0:9) do i
UInt8("$i"[1])
end
10-element Vector{UInt8}:
0x30
0x31
0x32
0x33
0x34
0x35
0x36
0x37
0x38
0x39
julia> map(0:9) do i
UInt8("$i"[1]) - 0x30 |> Int8
end
10-element Vector{Int8}:
0
1
2
3
4
5
6
7
8
9
Above we see that ā0ā corresponds to 0x30
or 48
. Thus we just substract that value from the byte values to get the digits.
julia> line[1:19] .- 0x30 .|> Int
19-element Vector{Int64}:
0
254
1
4
6
3
9
7
6
1
0
7
5
6
5
2
2
1
3
The exception above is due to the .
.
julia> UInt8('.') - 0x30 |> Int
254
The rest of the numbers can be parsed and summed using base 10, and then divided to get the intended value.
julia> sum((line[3:19] .- 0x30) .* (10 .^ (16:-1:0)))
14639761075652213
julia> sum((line[3:19] .- 0x30) .* (10 .^ (16:-1:0))) / 10^17
0.1463976107565221
julia> String(copy(line[1:19]))
"0.14639761075652213"
Combining the steps we could build our own float parser that looks as follows.
julia> function parse_float_no_allocations(bytes::AbstractVector{UInt8}, _start::Int = 1, _end::Int = length(bytes))
int_val = 0
power = 0
for i in _start:_end
byte = bytes[i]
if byte != 0x2e
int_val *= 10
int_val += byte - 0x30
power *= 10
else
power = 1
end
end
q = int_val / power
return q
end
parse_float_no_allocations (generic function with 3 methods)
julia> x = rand()
0.04775704047068485
julia> x_bytes = Vector{UInt8}(string(x));
julia> @time parse_float_no_allocations(x_bytes)
0.000006 seconds (1 allocation: 16 bytes)
0.04775704047068485
julia> using BenchmarkTools
julia> @btime parse_float_no_allocations($x_bytes)
38.345 ns (0 allocations: 0 bytes)
0.04775704047068485
julia> str = String(copy(x_bytes))
"0.04775704047068485"
julia> @btime parse(Float64, $str)
156.581 ns (0 allocations: 0 bytes)
0.04775704047068485
Thatās a pretty crude parser, but I think you can see the idea. Iām not sure what the advantage of implementing a parser from scratch would be other than incorporating prior knowledge that you have about your data and cutting corners. Perhaps it would be useful to build a custom high-performance parser around your data format though.
Yes, I think it does try to memory map.
That is a very good point, and it didnāt cross my mind, thanks!
Sorry, didnāt think about this previously, will definitely keep that in mind for subsequent posts.
Assuming your suggestions refer to this file format and not the MRE in the original post, hereās how Iām currently reading it:
num_lines = countlines(filepath)
num_atoms = 2000
num_steps = (floor(Int, num_lines / (num_atoms + 9) ))
boxlengths::Vector{Float64} = zeros(3)
positions::Vector{Vector{Float64}} = [zeros(3) for _ in 1:num_atoms]
@time open(filepath) do io
for s in 1:num_steps
# Skip headers
for _ in 1:5
readline(io)
end
# Read box bounds
for i in 1:3
boxlengths[i] = sum(abs.(parse.(Float64, split(readline(io), ' '))))
end
# Skip header
readline(io)
# Read positions
for i in 1:num_atoms
positions[i] = parse.(Float64, split(readline(io), ' '))[3:5]
end
#Do calculations here
end
end
0.926642 seconds (3.06 M allocations: 217.515 MiB, 2.07% gc time, 5.83% compilation time)
And thatās pretty decent I reckon. I plan to replace vectors with SVectors at some point, assuming that would make it faster overall. Perhaps parse.() could be also replaced, but not sure (yet) exactly how.
Replacing the ā# Skip headersā loop with the line readuntil(io,"pp\n")
(which does the same thing), gives exactly the same performance.
Would that imply replacing the ā# Read positionsā part of the example with a CSV.read(something)?
Not quite sure how this command works in this context if Iām honest.
Very educational post @mkitti, thanks a lot!
Iāll keep that in mind, perhaps Iāll dive into it once I optimize the rest of the code.
You can do it like this
using CSV, DataFrames
open("test.data") do f
readuntil(f, "ITEM: ") # skip over first ITEM: string
while !eof(f)
boxsize = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame)
positions = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame)
end
end
You also get the correct labels from the file format, e.g. you can do stuff like positions.xs
etc.
Thatās pretty elegant indeed (and a bit faster), thank you very much for the suggestion!
Out of curiosity, would there be a way of reading the data into a vector of vectors (or even SVectors) instead of a Dataframe?
You can convert it like this:
SVector{3, Float64}.(positions.xs, positions.ys, positions.zs)
During the parsing part there is no performance benefit from using SVector
, only for your subsequent processing there might be.
give this a try
r="123.14639761075652213"
dp=findfirst('.',r)
pi= r[1:dp-1]
pd= r[end:-1:dp+1]
PD=reduce((s,c)->(s+(c-'0'))/10,pd,init=0)
PI=reduce((s,c)->s*10+(c-'0'),pi,init=0)
PI+PD
PS
can you tell us if the solution with readuntil is still slower than numpy?
could you post the benchmarks of the different proposals?
Thank you for the suggestion @rocco_sprmnt21.
Could you please explain a bit whatās happening in this code?
How would you use it to read a file? (for example, the MRE in the original post)
Good point, I will prepare a comparisson and post it when I find some time.
FWIW, a version using a mutable structure to collect the data (*edited with roccoās append! suggestion):
using CSV, DataFrames
file = raw"C:\Users\...\LAMMPS_test.data.txt"
Base.@kwdef mutable struct MyData
timestep::Vector{Int} = Int[]
natoms::Int = 0
boxsize::Matrix{Float64} = Array{Float64}(undef, 0, 0)
positions::DataFrame = DataFrame()
end
M = MyData()
# Read repeated items only once:
open(file) do f
readuntil(f, "ITEM: NUMBER OF ATOMS")
M.natoms = parse(Int, readuntil(f, "ITEM: BOX BOUNDS pp pp pp"))
M.boxsize = Matrix{Float64}(CSV.read(IOBuffer(readuntil(f, "ITEM: ATOMS")), DataFrame, header=false))
end
# Read varying items:
open(file) do f
readline(f)
while !eof(f)
push!(M.timestep, parse(Int, readuntil(f, "ITEM: NUMBER OF ATOMS")))
readuntil(f, "ITEM: ATOMS")
append!(M.positions, CSV.read(IOBuffer(readuntil(f, "ITEM: TIMESTEP")), DataFrame; delim=' ', ignorerepeated=true))
end
end
# Access loaded data with:
M.timestep
M.natoms
M.boxsize
M.positions
M.positions.zs
Out of interest did you attempt the arrow approach - further, did you know that you can store as a valid struct type in the arrow format, so could convert into rows of lammps(?) and retrieve by index after the fact?
You could essentially do: CSV ā Arrow of lammps rows. If you then decided to uniquely identify them and store as a dictionary (of lammp data key to row index) you could grab almost instantly on demand and then act on in your downstream code.
Regards,
Thank you very much for this suggestion @rafael.guerra.
This solution is a bit slower than the one in this post, however, the results are very tidy! I could see myself using a sweet spot between performance and tidiness, depending on the demands of my final code.
Here are some rough benchmarks (for a file of 403809 lines, using my machine):
Most efficient (from this post, adapted from Steffenās): 1 second
Mutable structure (from this post): 2 seconds
Numpy: 5.5 seconds, using the following code
for i in 1:steps
a = pyconvert(Matrix, np.loadtxt(filepath, skiprows= floor(Int,(i-1)*(2000+9) + 9) , max_rows=2000, delimiter=' '))
end
CSV.read: using the naive implementation of the original post (as seen below), makes my laptop scream and doesnāt finish even if I leave it running for over a couple of minutes.
for i in 1:steps
a=CSV.read(filepath,DataFrame,skipto=floor(Int,(i-1)*(2000+9) + 10) ,limit=2000 ,header=false,delim=' ')
end
When I find some time, I also have to test the SQlite method mentioned by @mnemnion, DuckDB mentioned by @rdavis120, and arrows mentioned by @rocco_sprmnt21, but that would require some homework, hopefully Iāll be able to test them soon.
Also, thank you @djholiver for the aditional information on arrows, Iāll consider that!
Didnāt expect so many options to be available, props to everyone whoās putting in the time and effort to develop the Julia ecosystem. Kudos!
Could you post the final code using Steffenās solution?
How does the dataframe created with readuntil()
accumulate the different chunks? When I try it, it seems to replace the previous chunk.
PS: This was the reason I put an ugly vcat
in my code example.
To your comment on the ecosystem - is this usecase repeatable or broadly used enough to justify a package? The lammps format appears to be well known.
Aside from that - I canāt see any data in the posts about which to create a code snippet around and arrow version. If you are able to post some I might have the time to provide a benchmark of the above suggestion if that would prove useful?
See this post.
@rafael.guerra Here is the code I ended up using (which I corrected slightly from the original post).
using CSV, DataFrames
open(filepath) do f
while !eof(f)
readuntil(f, "ITEM: BOX BOUNDS")
boxsize = CSV.read(IOBuffer(readuntil(f, "ITEM: ATOMS")), DataFrame, header=false, skipto=2)
positions = CSV.read(IOBuffer(readuntil(f, "ITEM: ")), DataFrame, header=false, skipto=2)
end
end
@djholiver Here is a link to a file using the lammps format. For future reference, this is a OneDrive link and it will expire in one month if Iām not mistaken, sorry but I donāt know if I could keep the link active forever.
As for this comment, thatās just genuine appreciation for everyone who provides me with the tools to do my job in an efficient way, for free. I wouldnāt dare ask for something thatās more specific to the lammps format, I think whatās available already would be more than enough.
This is why I made the original post using a generic MRE (hereās the general solution, for anyone whoās lost with all the posts in this thread), to find something that can be used by anyone with a similar problem.
Nevertheless, I really do appreciate how you guys offer to help with my specific example.
This doesnāt seem to give a complete answer to @rafael.guerraās question, at least as I understand it.
Iāll try to make the point clear. There is a text file (made up of various pieces of non-homogeneous tables) that is too large to be loaded into memory and processed.
It is then loaded in chuncks and processed (how? The detail of this aspect is missing) what happens to the various processed pieces? Are they aggregations of the original pieces and therefore, being reduced, can they be combined into a single table or must different tables be created (and saved separately)?
Other?
PS
Using append! instead of vcat in @rafael.guerraās snippet the times could be shortened
Thanks for the link. Hopefully Iāll get some time to have a look.
Appears I was unclear - it was a genuine question if this should end up as a package given the data format and possible usages in scientific fields.
Thanks for the insight @rocco_sprmnt21, I didnāt read @rafael.guerraās post carefully enough.
In my particular case, I would like to do some calculations using the āchunkā of data, then replace them by the next chunk, effectively discarding them. I donāt want to accumulate them into another array, since it would be too much for my RAM, if I were to use a full example.
I suppose that might be far from the topic of this thread, thatās why I avoided it. To satisfy anyoneās curiosity, I would apply a series of functions to the data, which would lead to a 3-element vector for each of the chunks, which is then saved in a vector of vectors, everything else is discarded. Iām trying to calculate some correlation functions basically.
I appreciate it!
I donāt think I can answer that unfortunately. LAMMPS is just one among many different software packages people use to perform molecular dynamics simulations. It is the only one Iāve used so far, and Iām very much a beginner in this field. I reckon different packages could have their own unique output data formats, following different conventions. Usually, people working in the field are fairly comfortable with programming, and they would surely be able to write efficient code to import their data, or perhaps modify the format exported by the MD software.