Parsing a .gro file

Hello everyone, novice Julia user here.
I’m using Julia to write some scripts for data analysis of Molecular Dynamics simulations and I need to parse some .gro files(a concise description of the file format can be found here ). Following is a ‘snippet’ from one of my files, highlighting the relevant features for my question:

2500water OW1 9997 1.706 4.652 1.984
2500water HW2 9998 1.731 4.736 2.023
2500water HW3 9999 1.790 4.610 1.963
2500water MW410000 1.721 4.657 1.987
2501water OW110001 3.831 1.263 3.660
2501water HW210002 3.807 1.309 3.740
2501water HW310003 3.752 1.264 3.607
2501water MW410004 3.817 1.269 3.664

The way I want to parse this (part of) file would be as follows
resid::Int => 2500, 2501
resname::String => water
atomname::String => OW1, HW2, HW3, MW4
atomid::Int => 9997:10004
r::Vector{Float64} => [x, y, z]

I have managed to do this task with the following code, but it looks to me quite messy

# configlines is a Vector{String}, each String being a line
for (i, line) in enumerate(configlines)
        # split by whitespaces, in general they are in variable number
        splitted = filter(!isempty, split(line, r" +"))
        # splitted[1] is of the form "XXXXname" where X is digit, no whitespace between digits and string
        resid = parse(Int, split(splitted[1], r"\D+")[1])
        resname = split(splitted[1], r"\d+")[2]
        # notice that at some point in the file atomid>10000 and no whitespace is left between atomname and atomid
        atomname = splitted[2]
        try
            atomid = parse(Int, splitted[3]) #throws error if {atomid}{atomname} are not separated by whitespace
            r = parse.(Float64, splitted[4:6])
            config[i] = Particle(resid, resname, atomname, atomid, r) # irrelevant
        catch e
            atomid = parse(Int, split(atomname, r"\D+\d")[2])
            atomname = atomname[1:end-length(atomid)]
            r = parse.(Float64, splitted[3:5])
            config[i] = Particle(resid, resname, atomname, atomid, r) # irrelevant
        end
    end

Now, as I said, this does actually work correctly on my files, but I was hoping someone could give me some hint about a possibly cleaner implementation, and maybe even faster? On my laptop this takes ~200 ms for a 4k-lines file.
The main “problem” I have with this is how I have to deal with digits and strings touching each other without a whitespace in between.

All suggestions welcome,
Thank you.

Sorry, accidentally hit Enter and the question was posted incomplete. I cannot edit OP. I’ll continue in the following comment.

EDIT: Actually managed to edit OP. It was my first post, I just had some troubles with the interface. Sorry again.

Hi!
There are at least 3 possible performance issues with your code: use a function, avoid global variables, try… catch (which you can avoid)
See also:
https://docs.julialang.org/en/v1/manual/performance-tips/

I would recommend to write a separate function for parsing, and update the config variable in another function.
How does this pure regex solution compare?


function splitLine(s)
    rx = r"^(\d+)(\D+) +(\D+\d) *(\d+) +([\d.]+) +([\d.]+) +([\d.]+) *$"
    mc = match(rx,s).captures
    (resid=parse(Int,mc[1]),
    resname=mc[2],
    atomname=mc[3],
    atomid=parse(Int,mc[4]),
    r=(parse.(Float64,mc[5:7])))
end
configlines = split("""2500water OW1 9997 1.706 4.652 1.984
2500water HW2 9998 1.731 4.736 2.023
2500water HW3 9999 1.790 4.610 1.963
2500water MW410000 1.721 4.657 1.987
2501water OW110001 3.831 1.263 3.660
2501water HW210002 3.807 1.309 3.740
2501water HW310003 3.752 1.264 3.607
2501water MW410004 3.817 1.269 3.664""",'\n')

using BenchmarkTools
@btime splitLine.($configlines)

This gives me 8.466 μs per 8 lines, which is 1.058 μs per line.

I didn’t really think about doing it all with a single regex but it’s indeed a better approach!
I actually had to apply some minor corrections: * at the beginning to account for leading whitespaces and -* in the last 3 fields for possible negative values.

Timings also improved:
on a 40k-lines file I obtain ~75 ms with a list comprehension [splitline(s) for s in configlines] and ~80 ms with splitline.(configlines); the list comprehension also uses a bit less memory.

Thank you very much for your help!

1 Like

Glad to help! The regex makes for short code, but its usually not the fastest way.
The function I proposed is also not typestable (@code_warntype splitLine("asdf")), this is better:

function splitLine(s)
    rx = r"^(\d+)(\D+) +(\D+\d) *(\d+) +([\d.]+) +([\d.]+) +([\d.]+) *$"
    mc = match(rx,s).captures
    (resid=parse(Int,mc[1]),
    resname=String(mc[2]),
    atomname=String(mc[3]),
    atomid=parse(Int,mc[4]),
    r=parse.(Float64,mc[5:7]))
end

According to the documentation, first 4 columns have fixed width. Thus, you can avoid any regex and use substrings by indices (assuming there’s not going to be non-ASCII characters which seems justified):

function grostring(s::AbstractString)
    resid = parse(Int, SubString(s, 1:5))
    resname::String = SubString(s, 6:10) |> strip
    atomname::String = SubString(s, 11:15) |> strip
    atomid = parse(Int, SubString(s, 16:20))
    r = parse.(Float64, SubString(s, 21:44) |> split)
    return Particle(resid, resname, atomname, atomid, r)
end

In your example, the atom seem to take only 4 characters. Is that Discourse formatting or the file itself doesn’t correspond to the documentation?

1 Like

The “Chemfiles” package appears to support .gro files: https://github.com/chemfiles/Chemfiles.jl

Thank you, this is actually the kind of solution I was looking for, but didn’t knew about the existence of SubString. And you are right that in my example some whitespaces where cut off.

Still, the regex solution by @AndiMD seems to be ~20% faster and allocates (slightly) less memory.

I knew about this package, but I wanted to learn how to do the task rather than just using this tool; actually I tried to work my way through their src before asking here, in order to see how they do it, but I was not successful! Thank you nonetheless

1 Like

actually I tried to work my way through their src before asking here, in order to see how they do it, but I was not successful!

Chemfiles’ author here :smiley:. Chemfiles is implemented as a C++ core that the Julia package calls into, the part parsing GRO files is here: https://github.com/chemfiles/chemfiles/blob/76e77998739420fef2a2080212d148da9ac8d28b/src/formats/GRO.cpp. Since the code is distributed under BSD license, your are welcome to take inspiration from it, as long as you acknowledge it if you end up copying code.