Workflow tips for small-team academic projects

I’m a longtime MATLAB user and have been learning Julia the past few weeks with a view to switch at least computationally intensive, if not all, coding work. But while I think I’ve gotten the hang of language basics like syntax, types, multiple dispatch, etc., I’m struggling to develop a comfortable workflow, or to see how it would scale once I bring collaborators on-board. I’ve read tips from Julia veterans scatted around this forum and elsewhere, but they seem to mainly be geared at people developing packages for widespread use.

Right now, I simultaneously work on a couple projects (read: academic papers in econ/finance) with partially overlapping sets of 2-3 co-authors. We store code in Dropbox (yeah, I know, I know…). Each project has a shared directory. Within it, there’s a “code” subdirectory that lives alongside tex paper drafts, data, and other subdirs. Workflow is – open up MATLAB, cd to the code subdir, open the script and associated functions, start working. When finished, hit save and go grab a beer.

How would you adapt this for Julia? To be more specific, here’s a laundry list of questions which I’ve run into trying to answer the big-picture one. Feel free to answer all, some, one, or none and instead just share your own workflow if it’s related.

  1. Would every “project” as described above correspond to a package stored on Github?

  2. Is it possible/wise to have local storage hierarchy be project_name/code instead of .julia/path/to/code/project? Or should I not care where the local storage is at all once I start using git/github?

  3. Should I be doing test-driven development for my projects or is that overkill for academic projects in small teams?

  4. What’s the right away to adapt package creation/development instructions from the manual for packages that aren’t ever going to end up in the registry?

  5. Suppose part of a project requires coding up a set of tasks that I can see being re-used in other projects. Sounds like they merit their own package, but this package would also be specific to the needs of my team. How to develop a package and its dependency at the same time if neither are, or will ever be, in the registry?

To an extent, some of these are general (not specific to Julia) questions about adapting software engineering workflows to small academic teams. It’s just that Julia, for better or worse, seems to force users to adopt these conventions in a way that MATLAB doesn’t. I’m willing to adapt, but just not sure how.

Thanks!

13 Likes

Re test-driven development: Yes, I would absolutely recommend that you write automated tests for every piece of code you write, no matter how small. In my experience, it is very unlikely that you write a piece of code once and then never touch it again. If you have automated tests, you can change your code, rerun the tests and be confident that everything which worked before still works. And if it doesn’t, you usually get more fine-grained insight into which part is now broken.

I heartily second the use of test-driven development from the get-go. It has saved me countless hours of potentially head scratching because code I have no idea how, or if, a piece of code I wrote three months ago works.
I would also recommend the use of a local registry. For that, https://github.com/GunnarFarneback/LocalRegistry.jl is an incredibly useful resource. With a local registry, your team can focus on code that is relevant for your work without worrying too much about whether the code is useful (or sometimes even legible :slight_smile: ) by anyone outside your team.

Each project can have its own julia environment by doing ]activate . in the subdirectory of the project. This keeps things reproducible, and gives each project a degree of independence from the others. The whole set of projects can be kept within an overall git project, if desired. When something is finished, it can be broken out and put in its own git project.

1 Like

This pretty much describes what I do. I found the following convenient:

  1. Private projects on Gitlab. It provides CI.

  2. Check in the Manifest.toml, making a reproducible environment. This is important — when you have a deadline for a revision, you don’t want to spend a week porting code just becase the API of some packages changed.

  3. Changes happen in branches, which at least one co-author reviews. We caught a lot of mistakes that way.

  4. Test religiously. Think of invariants coming from the model and use them. Eg in a model I am now working on the value functions should be homothetic wrt a set of parameters, so we just calculate it two ways and test that. Also, whenever you optimize a piece of code, you can put the old version in the tests for comparison.

  5. We work on a two computational servers for estimation. Whenever we want to run it, we just pull from the Gitlab repo to the server, then make a branch and push back the results (the results are small, they get saved in the repo).

  6. All coauthors should commit to using Git. If they find it difficult, someone (you :wink:) should be prepared to offer setup and help.

Regarding your other questions:

  1. yes, every project should be a project (not a necessarily package)
  2. storage location is up to you, just pkg> activate your project whenever you are working on it (I keep stuff in ~/research)
  3. use a package template generator, eg PkgTemplates.jl or PkgSkeleton.jl
  4. if some code is reusable, put it in its own package, then if it is generally useful, consider registering it
18 Likes

Real useful to read this, Tamas.

@elenev, keep in mind that you don’t need to commit to all of those excellent points in one go. I’ve moved from Matlab ages ago and I wish I did everything Tamas does, but I don’t! My colleagues will learn git after hell freezes over, that’s just that (which is fine, cause they almost never contribute to the code). There is nothing wrong with taking small steps.

4 Likes

These are excellent tips, thanks! Since there are a few separate workflows here, let me try to summarize them and you (or others) can correct me where I’m wrong.

Starting a new project

  1. Create a new folder (in your case, ~/research/my_new_project)
  2. Create a git repo in that folder and sync it to a remote hosting platform shared with co-authors (Github or Gitlab – I need to do more research to understand the difference and the importance of CI)
  3. Create a Julia project by activating a new environment. Then create your main package in a subdirectory e.g. ~/research/my_new_project/main_package directory, with a subdir tree (with src and test, etc.) and dev’ing that package. Additional packages need to share the same environment, but get their own directory tree inside the same git repo/Julia environment i.e. ~/research/my_new_project/helper_package.
  4. Create “tests” i.e. scripts that call as yet non-existent functions to output the results I eventually want to get.

Working on this project

Suppose I want to make some changes to the latest version of main_package.

  1. Create a branch (e.g. called “edits”), clone it locally.
  2. activate the corresponding environment in Julia, dev the package (again? do I need to?), write some code and test it.
  3. When changes are done and it passes the tests, commit them. Or do I commit more frequently even when things don’t work yet?
  4. When everything works great and produces a new rest of results with which we can update the paper/slides, merge the branch back into master. This way, master always matches a set of presentable/presented results.

If collaborators want to help with making changes, do they work on the same new branch in (3)? And we periodically resolve conflicts when pushing our commits? Is there anything Julia-specific to this part of the workflow i.e. does the need to activate environments and manage package dependencies introduce any wrinkles to the normal git collaboration process?

Lastly, 2 more workflow questions:

A. If there are 2 of us collaborating and each of us works on 2 machines (e.g. home and work), what’s the right way for me to leave off working on, say, the work machine, and pick up at home, without pushing broken code to the repo shared with my collaborator? Is the right thing to do, relative to the workflow above, to create a branch off of “edits” that’s specific to me (coauthor has no write access), commit broken code to it, and merge it back into “edits” once it’s no longer broken?

B. If I have the paper draft, slides, etc. update automatically from latest results, does it then make sense to keep those in the repository as well? They’re in tex, so it’s doable, but getting coding coauthors to adopt git is easier than getting ALL coauthors to adopt git… :slight_smile:

Thanks in advance for your responses. Sorry for the long questions, but hopefully this thread will help others as much as it’s helping me.

2 Likes

Again, I would really recommend using a package template generator for the initial setup.

There isn’t anything Julia-specific to the git workflow. You can find tutorials about it under the term “feature branch workflow”, eg this one.

I would recommend using a repo on Github/Gitlab for collaboration, even for two people. Generally a single person should work on a branch (unless they coordinate, but then why not just branch from there).

It makes sense to keep source files in the repo, eg LaTeX, for the paper and slides. Binary blobs like PDF benefit less from version control, but you can still commit them. I usually just keep sources in the repo.

1 Like

May be it’ll be of use to you, there is a nice package DrWatson.jl which looks like it can cover lots of your needs, like code organization, setting up reproducible environment and so on. It worth to look at their workflow tutorial It is really well written and establish good habits.

I am not working in with scientific environment, so my workflow can differ from yours, but here is how I usually organize my work.

  1. Start in _research folder. This is the time of exploration, when nothing is established yet, so code is rather messy, non-generic and is basically bunch of loosely connected snippets.
  2. At some point structure appears. Some functions are factoring out and can be reused. At this point it is good to put them in src folder and use Revise.jl includet function to get “package” feeling. I.e. if you put some of your functions to src/helpers.jl then _research/intro.jl can start with
using DrWatson
@quickactivate "DrWatson Example"
using Revise
includet(srcdir("helpers.jl"))

and after that you can continue to work in _research folder, but any changes in src will be immediately updated, so usually at this point snippets tend to be more focused on results only with all machinery being hidden in src.

  1. This is also good moment to start writing tests. Usually everything is still in flux so there is no need to write too deep tests (yet opinions may vary of course), but at least you can fix interfaces, i.e. some simple things like "function foo accepts parameters x and y", so there can be not a single @test in your tests
@testset "function foo" begin
  x = 1
  y = 2  
  foo(x, y)
end

This way you can trace in the future which functions will fail if you or your colleagues change function signature. Also this way of writing tests also documents how to call functions properly and how to prepare input data, since you can read it just by looking at the test code. Of course in practice you usually also know output (basically because you get it few minutes before when you actually make a call of this function) so you can add it to the tests.

  1. One thing to remember, tests and helper function are here to help you, not to restrict you. So it’s absolutely normal at this stage to refactor code and tests, to better suit your needs.

  2. When you finish your project and ready to move to another, it may be worth to reflect on the code you have already written. If you see that some pieces of code can be reused in other project, it’s a good moment to create a package and you already have all necessary elements for a good package: isolated functions, tests, snippets of code which can be turn into documentation.

This all of course should be added to previous points discussed in this thread (testing, source control etc).

2 Likes

Regarding Gitlab/Github you are not limited to storing code files on Git repositories.
I would think of copying your files over from Dropbox and starting a private gt repository, as advised above.

You can store large binary files on Git repositories by using LFS

I feel a lot of sympathy for your current situation, so I hope these answers and recommendations, based on personal experience, are useful:

Not exactly: your “Matlab projects” can be equated to “Julia projects”. Packages are a specific type of project, meant to gather tools that can be reused in other projects. Thus, the files of packages are organized in a specific way, used by the Julia package manager.

A Julia project is a broader concept, and you can organize its files at your pleasure. The only requirement is that at the root code folder of the project, there must be a Project.toml and a Manifest.toml file. But they are automatically created (see below), so you may proceed without taking care of them.

You are free to put your projects wherever you want. Again, the only requirement is that each project has its own folder, with their own Project.toml and Manifest.toml files. It is not even necessary that all the files related to the same project are in subfolders of the project root folder. Just organize them as it is convenient for your team. (The package DrWatson recommended by @Skoffer may help you with complex and folder structures, but it’s not a must at all. If your projects are small as you say, maybe you can go smoothly without it.)

Test-driven development is good. As already answered by others, it will help you revise and extend your projects preventing dead ends (“I have to fix this bug, but will I break something else that was already working?”, etc.) And Julia has very fine tools to support it. But that’s orthogonal to using Julia; for instance, you might adopt test-driven development workflow with Matlab too.

My advice is: if you have no hurry to change to Julia, adopt a test-driven workflow (basically, think on the tests before writing the code) with the language your team feels comfortable with. Then, if you are using Matlab, in my (very subjective) opinion you might feel happy to move from Matlab’s tools for TDD tests to Julia’s.

Or if you prefer, do it the way around (first move to Julia, then think on improving your development workflows). But maybe it’s better not take both things at the same time. Changing habits and tools is always complicated and creates frustrations; and if you change both at the same time, frustrations due to one of them will contaminate the experience with the other.

I’ll answer those two together:

My advice is that you start creating projects, which are simpler than packages (no specific organization required, as said above).

When you start a Julia session to work in a project, do ]activate path/to/your/code/folder - or cd("path/to/your/code/folder") and then ]activate . - . You’ll be then in an fresh new, empty environment. Then ]add the packages you’ll use in that project (also if you already installed them in the default environment).

You don’t need to add them all at once; feel free to add them only as you need using (or importing) them. This will populatethe Project.toml and Manifest.toml files in the folder of your project, which will be reproducible and portable as long as you keep the set of added packages as is. Some tips:

  • Start a different Julia session to work on each project, and activate its environmente before using any package, to ensure that you are actually using the versions you have defined for the project.
  • If the packages work reasonably well, and the project is in a status such that breaking changes may cause headaches, don’t update them in the projects, unless the update fixes some bug that is relevant for the project. Better keep the project reproducible and stable. You’ll be able to work with updated versions in other projects with new environments.
  • If you want to use exactly the same packages of an older project in a new one, you can just copy the Project.toml and Manifest.toml files into the new project folder, instead of adding the same packages manually.

For the rest, I’d say you can start working as in Matlab, and adopt the “Julia way of doing” progressively, without too much stress. I did it that way, and felt it a very satisfactory process.

My first move would be getting rid of Matlab’s rule of having each single function in a different file, which is really annoying. Group your functions in files as you see fit, and make them visible to your main script by include-ing those files.

If you arrive to some point where you feel like packing some functions, so that they could be used in other projects, you can do this first move to get them closer to a package, without leaving the “comfort zone” of having all things inside your project:

  • Wrap the file with the function definitions into a module (exporting the ones you’ll use directly for easier usage). E.g. if your functions were myfunction1 an myfunction2, the file MyProjectFunctions.jl would look like:
module MyProjectFunctions

export myfunction1, myfunction2

# Here there is the code of `myfunction1`, etc.

end
  • Now, in the script where you had include("MyProjectFunctions.jl") add the line using MyProjectFunctions, to make the module usable.

Note: when you work with modules, you should be definitely using Revise. This package makes life more comfortable (not having to “re-include” files when functions are rewritten, making the experience closer to Matlab in this regard), but for modules it is a must, because redefining modules “by hand” (i.e. re-running the code that defines it) is problematic.

When you do this, you’ll see that the main difference of having the functions inside a module is that the keywords moduleend create a sort of “fence”, such that only export-ed objects will be directly accessible by their name. You can have other auxiliary functions, constants, and other things nicely “hidden” in the module, such that to access them you should call them by the module’s name, e.g. MyProjectFunctions.auxiliaryfunction, etc. And you don’t have to worry about duplicated names between functions, variables etc. inside and outside the module. (Formally, this is all to say that modules have their own global scope, but maybe you don’t need to think a lot about it in the beginning.)

Once you feel confident with modules and those subtleties – and I think it doesn’t take too much once you try to create one --, turning it into a separate package is just a couple of steps away:

  • Type ]generate MyPkg somewhere to generate a very basic folder structure for your package, or if you want something more complete, use any of the tools suggested by @Tamas_Papp (PkgTemplates.jl PkgSkeleton.jl).
  • Now, copy the content of the module in the corresponding file in src, and you’re done!

If you just want to have the tools of such a package available for quick use in any project, you don’t really need to put it into a registry. Feel free to keep in your Dropbox in path/to/MyPkg, and ]add path/to/MyPkg to use it as is, or ]dev path/to/MyPkg if you think you’ll have to tweak it while working in the projects.

@grero’s suggestion of using LocalRegistry is good, but I don’t recommend it for newcomers, since to start with you’ll need to deal with Git, and that may be a barrier if that’s new for you or your team (I don’t know what’s your case).

With respect to Git and related tools (e.g. Github, Gitlab or whatever), they are ubiquitously mentioned in docs about Julia, but my recommendation with respect to them is the same as for the test-driven development. If you already know it go ahead with them: they are super-useful to avoid frustrations due to code that was working and now doesn’t, collaborative coding, etc.

But (personal advice, maybe others won’t agree): don’t make the effort of learning Julia and Git at the same time. Focus on Julia and keep using Dropbox for sharing and managing versions, until you feel confident with the language. Then you may experiment with Git etc., when you feel like learning it to have better control on the versions.

7 Likes

Yes, branching is perfect to push/pull work in progress and keep it updated in different machines, without disturbing the master branch. When the changes are finished, merge it - or if you want to keep the history clean of the “broken” intermediate commits, you can merge --squash. (In GitHub etc. this is very straightforward via pull requests.)

Agree with @Tamas_Papp: keep source files, and rebuild the final documents in local. I usually add *.pdf, in the .gitignore file to avoid them being committed by accident.

@Tamas_Papp Thanks for that git workflow guide. It’s helpful. As for package template generators, my understanding is that both yours and the other one enforce one git repository per package, which makes sense if these packages will eventually be in a registry (although based on LocalRegistry.jl, which @grero recommended, as well as Julia Lang’s own stdlib repo that doesn’t look like a requirement) but not in my case.

DrWatson, recommended by @Skoffer, seems like a really nice and modular tool (as in, it doesn’t come bundled with a “way of thinking”). The default and un-customizable folder structure is a bit restrictive, but you don’t have to use it. From a setup perspective, its main contribution seems to be the @quickactivate macro, which ensures that any script containing it, whatever subfolder it’s in, will run in your project environment, rather than whatever the currently activated environment is. As I understand it, this gives you enforceable reproducibility even for code outside a package, right? By enforceable, I mean that it doesn’t require you to remember to activate a particular environment.

There is actually another feature of DrWatson which makes is really useful for research – the ability to store “results” with metadata about the code version (git commit id) that produced them!

@heliosdrm, your answers are incredible! Thank you for taking all that time to write them up. I think there are a few caveats to what you wrote, but correct me if they don’t apply:

  1. You can’t ]add a package from a local path if it’s not its own git repository. You have to use ]dev.
  2. If you include("MyProjectFunctions.jl") containing a module MyProjectFunctions, you need to preface the module name with a . when using or importing i.e. using .MyProjectFunctions will work but using MyProjectFunctions will not, b/c it will search for the module as a package in the environment’s dependencies in Project.toml.

This raises a related issue when using Revise, which seems like a must in pretty much any Julia workflow. Based on the Revise documentation, one’s choices are either replace all include() statements with includet() or keep all modules in packages. If you do the former, you’re relying on Revise to exist in any environment you ever run your code in, which may be a bad assumption for stuff that may have to run in batch mode on a high-performance cluster. Which basically leaves you with having to create packages for every bit of non-trivial code. And once you create packages, you have to manage their own dependencies/environments in addition to your projects i.e. Revise-based workflow comes at a substantial cost. Or am I missing something here?

Finally, @johnh, if you use Git LFS for storing large data files, do you store the large file on a remotely hosted repo e.g. Github? Do you pay for extra space? It would be nice to have git track large binary metadata while the actual files were stored somewhere else e.g. Dropbox, shared network drive, AWS, university HPC cluster. etc.

2 Likes

One git repository per package is no longer a requirement for registration.

This has been implemented for Julia 1.5. See Feature request: Store multiple registered packages in a single Git repository · Issue #1251 · JuliaLang/Pkg.jl · GitHub for the development story.

3 Likes