Best practices for single script files

Hey all :wave:

In the future I see myself writing a lot of single script files. They will be, almost exclusively, for analysing my molecular dynamics simulations. I imagine a rough skeleton of the script to be:

define some constants # May vary from script to script

read_trajectory_file()

loop through frames and atoms
compute some stuff
log the answers to some text file

Adding to it, I can imagine running the code like: julia code.jl on a HPC cluster.

How would the expirienced julia programmers code such a thing in the most efficient manner?
I am excited to read the responses :slight_smile:

Thank you in advanced.

Regards,
Ved

Follow the performance tips. Particularly, enclose your code inside functions and do not use (any, for simplicity) global variable.

Also, I recommend using environments. This is the workflow I use: Development workflow · JuliaNotes.jl.

By the way, since you are planing analyze MD simulations, take a look at MolSimToolkit.jl, particularly to the Simulation object described here. This is a framework we built exactly to make MD simulation analysis easy.

(Also look at the JuliaMolSim organization and, why not, the other packages we develop).

You might be interested in DrWatson.jl

Docs: Introduction · DrWatson.jl (juliadynamics.github.io)

Introduction video: DrWatson: The Perfect Sidekick to Your Scientific Inquiries | George Datseris | JuliaCon 2020 (youtube.com)

1 Like

Hey, thanks a lot for your reply. I will take a look at the links you have shared.

I mostly use GROMACS and LAMMPS for my work. I have used Chemfiles.jl in past to read .xtc files and I used to parse lammpstrj file manually by code. Your package looks quite neat too. I would live to talk about it, but this is not a good place for that :slight_smile:

I have preferred writing my own analysis code and use packages only to parse the compressed input files and I think I am going to keep writing my own analysis code.

Thanks again for your reply.

Thanks for your message. I have come across this package, but it looks like this package is very overkill for what I forsee. Nonetheless, this is a super cool package and maybe I will use it sometime later in my research xD.

Oh yes, MolSimToolkit.jl uses Chemfiles.jl as well. It just makes the interface a little bit simpler. The idea is exactly making writing custom analysis scripts easy.

For example, this is how the distance between two groups of atoms is computed along the simulation:

using MolSimToolkit
using PDBTools

function distances(
    simulation::Simulation,
    indexes1::AbstractVector{Int},
    indexes2::AbstractVector{Int};
)
    distances = zeros(length(simulation))
    for (iframe, frame) in enumerate(simulation)
        coor = positions(frame)
        cm1 = center_of_mass(indexes1, simulation, coor)
        cm2 = center_of_mass(indexes2, simulation, coor)
        cm2_wrapped = wrap(cm2, cm1, unitcell(frame))
        d = norm(cm2_wrapped - cm1)
        distances[iframe] = d
    end
    return distances
end


simulation = Simulation(pdb_file, trajectory_file)
pdb = readPDB(pdb_file)
distances(
    simulation, # Simulation object
    findall(Select("resnum 1"), pdb), # indexes of atoms of residue 1
    findall(Select("resnum 10"), pdb) # indexes of atoms of residue 10
)

Even with “time to first X” having gotten a lot better in recent version of Julia, I wouldn’t generally recommend this. It’s a type of workflow that works with compiled languages like Fortran, or even with Python, which has a small “startup cost”. Julia benefits from long-running processes, so adjust your workflows to that.

I can imagine running the code like: julia code.jl on a HPC cluster.

That’s fine as long as code.jl has a runtime of minutes at the very least (better hours or days – whatever the runtime of a typical HPC job). That is, don’t submit a job file that has a hundred different julia code.jl calls.

When working locally, I would recommend a (Jupyter) notebook workflow, where you have a long-running kernel. Or using the REPL. Even if you have a lot of code.jl file for small calculations or to produce plots, it’s much better to do include("code.jl") in a long-running REPL than to run julia code.jl in your shell.

Similarly, for an HPC run: have a single job.jl file which can still include smaller “script” files, but uses a single long-running Julia process. You can still use Julia’s excellent multi-threading support to parallelize things and to best utilize your HPC node.

2 Likes

Thank you so much for your message. This is exacly what I waned to know. I have been using FORTRAN for some time now, but I can see that my codes are becoming more complex and I would spend a lot of time coding things up. Julia would make coding things up quite fast.

Most of the codes would be mostly have runtime of more than a few minutes for sure.

but uses a single long-running Julia process.

I would have to see how to do this in my HPC. If you have any idea, then please do let me know.

Thank you once again, this was really helpful.

If you search this forum, you will find previous threads on using Julia on a cluster, e.g.,

You will also find general documentation on the web, e.g.,

Lastly, the HPC staff of the cluster you’re planning to use should be able to give you advice for their specific system.

My general recommendation was that your job script should only call julia once. Beyond that, calling julia on the cluster isn’t that much different from running it on your workstation. Except maybe that you should be more aware of the exact resources you’re going to use, and carefully manage the number of processes/threads. The JULIA_EXCLUSIVE environment variable might be useful on a cluster.

2 Likes

Thank you so much for such a detailed answer. This is super helpful.