Best practice: modules and scripts for publishing

I’m getting ready to publish an article using Julia-based code. I’m curious if anyone has thoughts about how such code should be arranged.

Right now, I have a bunch of modules to separate functions, and a top-level script to run them.

However, since these are modules but not packages, they exist in Main instead of on their own, and that has occasionally created problems when working with them.

Would it be better to have everything in the top level but separate files to be included? Would one option be easier or harder for others in some way?

Should I put everything into functions or is running a script file with calls for the working parts okay?

Curious if anyone has thoughts on this.

3 Likes

What I would do is organize things as if it were a “full-fledged” project

  • dependencies listed in Project.toml and Manifest.toml
  • real source code in files under the src directory, possibly organized in submodules
  • top-level “glue” script in test/runtests.jl

So that anyone wanting to reproduce your results only has to clone your repository (or otherwise get the files), then issue

using Pkg
Pkg.instantiate()
Pkg.test()

This may be overkill, but I would think that it is worth a little extra effort to make others’ life easier.

For a more minimalistic option, you might consider only defining an environment (via the Project.toml and Manifest.toml files) alongside a one-file “script” which users would have to run.

8 Likes

I don’t think it is. Organizing code for a paper as a package has a lot of various advantages, starting with CI. We do this with coauthors (on Gitlab), primarily to ensure that code on master is always kept in a working state, so that when we throw it on the cluster for running with the large dataset, we don’t get back failed batch jobs (we generated mock data for testing and benchmarking).

The other advantage comes from committing the Manifest.toml: you get a reproducible environment. This is invaluable when tracking down issues.

5 Likes

It’s overkill in the sense that the vast majority of scientific code is nowhere near this level of organization and reproducibility. That said, most programming languages don’t give you such an easy time providing a gift wrapped reproducible environment. It would be a crime not to use it.

For my next paper I’ve got a module, and then all my analysis code written as docs + doctests.

6 Likes

I’m in flux a bit on this, but what I’m settling on so far:

  1. Make a full package, as recommended above.
  2. Name should be descriptive and include something unique, like initials, so that it won’t crowd the namespace (e.g. CoolNewScript_NCB). Unless the package is intended to be more general-purpose and long-lasting, then go ahead and take a creative and/or general name.
  3. The main script file sets up the module, loads dependencies, and includes subfiles.
  4. Functions and data structures split into separate files, grouped into folders if more than a few. No submodules or second modules.
  5. Any solid general purpose code should ideally be split into a separate repository from the custom/experimental stuff, for inclusion in the package registry.

Haven’t settled on how to run it though. I’m hesitant to use “runtests.jl” and Pkg.test() for anything other than unit testing. Haven’t gotten the hang of doctests yet.

Maybe I should be doing something like a Juptyer notebook for the run script, but haven’t gotten into those either. I may just stick with a “run.jl” file or similar.

Thanks for your help! And @kevbonham, I’ll be curious to see how you structured your code once it’s out.

It’s still WIP, but you can take a look here. src/ has a module, set up like a normal package. docs/ are analysis notebooks (currently using Weave, not Documenter, but will try to do it with Literate + Documenter later). There’s also a data/ folder that will eventually interface with DataDeps to allow downloads of relevant stuff that currently only lives on our private server, and bin/ contains some miscellaneous scripts.

3 Likes

I used to rely on Pkg.test(), which had the nice property of taking care of test-specific dependencies (or, in this case, use-case-specific dependencies, as opposed to dependencies required by the code under the src/ dir). However, I don’t think I would recommend (ab)using Pkg.test() any more for running plain use cases (as opposed to unit tests), especially since I recently realized that Pkg.test() did a lot more than what I initially thought. In particular, it sets --check-bounds=yes , which is a very sensible thing to do for tests, but impacts the performance for “regular” use cases.

1 Like