Performance: read data from ascii file, replace `split`

lmiq · November 12, 2024, 6:40pm

The solution had then 3 parts:

using eachsplit as suggested in Performance: read data from ascii file, replace `split` - #2 by artemsolod
using the solution provided by Mason here to parse and set the fields in a type-stable manner: Unroll setfield! - #3 by Mason
Use InlineStrings as indicated in: Performance: read data from ascii file, replace `split` - #8 by artemsolod to reduce the memory footprint of the data structure being created.

The result is then very good. I can read now my 60M data objects in a minute:

julia> @time ats = readCIF("./all.cif")
106.647978 seconds (129.37 M allocations: 12.242 GiB, 14.24% gc time, 0.03% compilation time)
   Array{Atoms,1} with 64423983 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1   NP     PRO     7        1        1  171.946  588.581  135.200  1.00  0.00     0                 1
       2   HC     PRO     7        1        1  172.571  588.749  134.422  1.00  0.00     0                 2
       3   HC     PRO     7        1        1  171.019  588.890  134.923  1.00  0.00     0                 3
                                                       ⋮ 
64423981  CLA     CLA     I       50 20452615  104.220  615.013 -331.799  1.00  0.00     0          64423981
64423982  CLA     CLA     I       51 20452616  130.543  586.064 -347.000  1.00  0.00     0          64423982
64423983  CLA     CLA     I       52 20452617   87.912  628.908 -347.424  1.00  0.00     0          64423983

Topic		Replies	Views
Reading tab-delimited file & memory allocation New to Julia memory-allocation , io	5	810	February 19, 2022
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2048	February 23, 2024
Package to read/process lines without new allocations Package Announcements question , package , announcement	13	1021	May 5, 2023
Making string to float conversion faster? General Usage	16	1110	March 14, 2021
CSV vs DelimitedFiles vs Numpy Performance	15	969	January 20, 2024

Performance: read data from ascii file, replace `split`

Related topics