Why start domain-specific scientific projects as Packages?

Hi everyone, I’ve been trying to establish a good workflow while using Julia (Coming over from MATLAB) for my doctoral work. I’m at the beginning so I have some time set things up.

So far I’ve been quite comfortable with writing a script with an editor with a REPL, and then generalizing the script into a function (if possible) that I eventually add to my module (Revise = :heart_eyes:).

In the end: I have a module which contains all the functions that I use (and will use). As I develop more code, I will add functions to this module and I will have scripts that use the functions and generate specific plots/analyses that I need for my scientific project.

I’ve been introduced to making packages a few times, but never really got why I should do it. After briefly taking a look, it seemed a bit cumbersome and I wondered if it is only for more general software projects, etc. As an individual PhD student, a legacy of useful functions and my analyses which will be nicely documented (hopefully) should be pretty good. Can someone tell me how exactly making a package is useful in niche/specific projects, such as analyses needed for a thesis or paper?

Sharing: there must be parts of your work that would: 1) benefit others, 2) improve from reviewing by others.

Also, it’s nice to separate things into monolithic units. It feels nice.

2 Likes

I think there is a misunderstanding here: generally it is suggested that code is organized into projects, as it helps to get a reproducible environment.

Projects may or may not become packages, depending on how general the code is.

5 Likes

@ElectronicTeaCup my impression on this is as follows: try to write rather generic code in your module and at some point split some of it into “orbit packages”. Keep your local, domain-specific (or even research topic specific code) in a julia project (with Project.toml and most importantly Manifest.toml), that evolves along with your paper/research. When time (i.e. submission/publication :slight_smile: comes it’s very easy to push that to a separate repo for effortless replication.

Note: as long as you distribute Manifest.toml there is no need to register your packages.
e.g. (shameless promotion :slight_smile: Property (T) in julia - Nextjournal

1 Like

Revise = :heart_eyes:

Thanks!

Back when mammoths roamed the world and I wrote my lab’s code in Matlab, the fact that I wasn’t using unit-testing had a fairly disastrous long-term consequence: I had to “freeze” some of my most complex bits of code, because the risk of introducing new bugs grew to be larger than any benefit from fixing known bugs or adding new features. One of the main advantages of starting out with packages is it gives you a canonical place to dump unit-tests along with your code. If you’re like me, you tested your code as you wrote it, but I was just throwing those tests away into the dustbin of history.

You can use PkgTemplates.jl to create the folder structure in a few seconds, so really there’s very little disadvantage to doing this. I typically use

using PkgTemplates
t = Template(ssh=true,plugins=[TravisCI(), Codecov(), GitHubPages()])
generate("MyNewPkg", t)

Takes just a couple of seconds to do this; not only does it give you the src/ and test/ folders, it also sets up:

  • Travis testing (you can use services besides Travis, see the docs)
  • documentation structure
  • coverage analysis

You don’t have to immediately fill these in, but they’re there for you if you want them.

16 Likes

That is definitely a source of misunderstanding. Even as an experienced Julia user, I still don’t quite know how to effectively utilize the middle ground that exists between simple scripts and Packages.

1 Like

The main difference between a package and a proiect is that a package is a registered project. If your project is not expected to be re-used as the foundation for something else, it may be unnecessary to register as a package but may still benefit from the project layout.

This is a huge benefit that it’s really hard to overstate. If you also check in your manifest when you’re working on your project, it will be self-contained and future proof since you’ll always be able to run it with the exact versions of dependencies that you had when it worked.

Another good reason is that I think this is probably one of the best adages in software design: when solving a complex problem, design the API that you would want to use, implement that API, then use the API you implemented to solve the problem. In other words, put yourself in two different roles in different phases of problem solving—the library user and the library author. The reason this is effective is that if you’re like me, it’s hard to remember all the details of how something works, but if there’s a nice API that can act as an abstraction boundary, I don’t have to keep all of those details in my head no matter how complex the implementation may be. A good API is also testable, so you can write that part, test it until you’re really sure that it does what it’s supposed to do and then move on and use it without worrying that there’s some bug in it that’s causing problems at a higher level.

20 Likes

Don’t be too scared by the term “software design” (you don’t have to be a CS whiz). Stefan’s approach continues to work even for simple project-type packages. To start with, your API can be as simple as read_data(), analyze(), and plot_results(). As you dig into it, you may find it’s better to split your code into smaller methods, and that’s where you can think about a better API to help you now and in the future.

4 Likes

How does that work in practice? I find that I’m always coding inside of the global environment (except for specific purposes). Does anyone work in the environment of the package they’re currently developing with? What do you do about eg. BenchmarkTools? You don’t want to add it to that package’s Project.toml, so… stacked environments?

I put dev tools in @v1.x, that is the current shared environment, and whatever the current package/project needs in its environment. Then I have a .envrc file in each project with export JULIA_PROJECT=@. and I use direnv. That way I always have my dev tools in the REPL and I can load whatever the dependencies is the project whose directory I’m currently am in.

If you’re trusting, you could just put that export in your global bashrc file (or whatever shell you use), but it’s a bit dangerous since it means that cloning some project and starting Julia in that directory can change what packages mean (although you still have to install them).

8 Likes

I have a similar story regarding tests. Right in the beginning, I was taught that unit tests are a good idea when I started writing basic scripts. Soon after, however, I stopped writing them, since I would just quickly do “sanity checks” with some lines of code. Then that turned into “if it breaks then I will check what’s wrong” (yikes). One big issues in my last project was that by the end things were chaotic and impossible to maintain.

I certainly feel that It will help me to think more about testing my code, to make better unit tests as well as doing it by itself. And overall better organization, and therefore, better reproducibility (I have been told this is good in science).

3 Likes

Concerning the directory structure: Is it best to keep extra directories—such as analysis scripts (using the functions from the package) and reports separate from the package (within their own repo) or just bundle everything together?

1 Like

I use the project structure of DrWatson :

│projectdir          <- Project's main folder. It is initialized as a Git
│                       repository with a reasonable .gitignore file.
│
├── _research        <- WIP scripts, code, notes, comments,
│   |                   to-dos and anything in an alpha state.
│   └── tmp          <- Temporary data folder.
│
├── data             <- **Immutable and add-only!**
│   ├── sims         <- Data resulting directly from simulations.
│   ├── exp_pro      <- Data from processing experiments.
│   └── exp_raw      <- Raw experimental data.
│
├── plots            <- Self-explanatory.
├── notebooks        <- Jupyter, Weave or any other mixed media notebooks.
│
├── papers           <- Scientific papers resulting from the project.
│
├── scripts          <- Various scripts, e.g. simulations, plotting, analysis,
│   │                   The scripts use the `src` folder for their base code.
│   └── intro.jl     <- Simple file that uses DrWatson and uses its greeting.
│
├── src              <- Source code for use in this project. Contains functions,
│                       structures and modules that are used throughout
│                       the project and in multiple scripts.
│
├── README.md        <- Optional top-level README for anyone using this project.
├── .gitignore       <- by default ignores _research, data, plots, videos,
│                       notebooks and latex-compilation related files.
│
├── Manifest.toml    <- Contains full list of exact package versions used currently.
└── Project.toml     <- Main project file, allows activation and installation.
                        Includes DrWatson by default.

and keep everything together. I commmit everything besides plots and data to git as well.

8 Likes

Exactly! In DrWatson we even put Git into the mix automatically to ensure that the project is future and past proof at every step.

This is a very nice way to put it. I say the same thing in the “Writing good scientific code” workshop I give at research institutes, but this sentence really summarizes things very well.

4 Likes

Wow, this is incredible. Thanks for the share, I was just reading a paper on properly structuring your directory and DrWatson seems like being served exactly what I wanted on a silver platter.

2 Likes

Working in each package’s environment has been a very pleasant revelation for us, thank you. It is so nice to come back to an old package we wrote a long time ago and find that its bits haven’t rotten at all, and all tests pass.

6 Likes