The solution had then 3 parts:
-
using
eachsplit
as suggested in Performance: read data from ascii file, replace `split` - #2 by artemsolod -
using the solution provided by Mason here to parse and set the fields in a type-stable manner: Unroll setfield! - #3 by Mason
-
Use
InlineStrings
as indicated in: Performance: read data from ascii file, replace `split` - #8 by artemsolod to reduce the memory footprint of the data structure being created.
The result is then very good. I can read now my 60M data objects in a minute:
julia> @time ats = readCIF("./all.cif")
106.647978 seconds (129.37 M allocations: 12.242 GiB, 14.24% gc time, 0.03% compilation time)
Array{Atoms,1} with 64423983 atoms with fields:
index name resname chain resnum residue x y z occup beta model segname index_pdb
1 NP PRO 7 1 1 171.946 588.581 135.200 1.00 0.00 0 1
2 HC PRO 7 1 1 172.571 588.749 134.422 1.00 0.00 0 2
3 HC PRO 7 1 1 171.019 588.890 134.923 1.00 0.00 0 3
⋮
64423981 CLA CLA I 50 20452615 104.220 615.013 -331.799 1.00 0.00 0 64423981
64423982 CLA CLA I 51 20452616 130.543 586.064 -347.000 1.00 0.00 0 64423982
64423983 CLA CLA I 52 20452617 87.912 628.908 -347.424 1.00 0.00 0 64423983