Hello everyone, novice Julia user here.
I’m using Julia to write some scripts for data analysis of Molecular Dynamics simulations and I need to parse some .gro files(a concise description of the file format can be found here ). Following is a ‘snippet’ from one of my files, highlighting the relevant features for my question:
2500water OW1 9997 1.706 4.652 1.984
2500water HW2 9998 1.731 4.736 2.023
2500water HW3 9999 1.790 4.610 1.963
2500water MW410000 1.721 4.657 1.987
2501water OW110001 3.831 1.263 3.660
2501water HW210002 3.807 1.309 3.740
2501water HW310003 3.752 1.264 3.607
2501water MW410004 3.817 1.269 3.664
The way I want to parse this (part of) file would be as follows
resid::Int
=> 2500, 2501
resname::String
=> water
atomname::String
=> OW1, HW2, HW3, MW4
atomid::Int
=> 9997:10004
r::Vector{Float64}
=> [x, y, z]
I have managed to do this task with the following code, but it looks to me quite messy
# configlines is a Vector{String}, each String being a line
for (i, line) in enumerate(configlines)
# split by whitespaces, in general they are in variable number
splitted = filter(!isempty, split(line, r" +"))
# splitted[1] is of the form "XXXXname" where X is digit, no whitespace between digits and string
resid = parse(Int, split(splitted[1], r"\D+")[1])
resname = split(splitted[1], r"\d+")[2]
# notice that at some point in the file atomid>10000 and no whitespace is left between atomname and atomid
atomname = splitted[2]
try
atomid = parse(Int, splitted[3]) #throws error if {atomid}{atomname} are not separated by whitespace
r = parse.(Float64, splitted[4:6])
config[i] = Particle(resid, resname, atomname, atomid, r) # irrelevant
catch e
atomid = parse(Int, split(atomname, r"\D+\d")[2])
atomname = atomname[1:end-length(atomid)]
r = parse.(Float64, splitted[3:5])
config[i] = Particle(resid, resname, atomname, atomid, r) # irrelevant
end
end
Now, as I said, this does actually work correctly on my files, but I was hoping someone could give me some hint about a possibly cleaner implementation, and maybe even faster? On my laptop this takes ~200 ms for a 4k-lines file.
The main “problem” I have with this is how I have to deal with digits and strings touching each other without a whitespace in between.
All suggestions welcome,
Thank you.