Advice on structuring larger codebases

Hi,

I currently have a reasonably sized python codebase (10s of thousands of lines across 100s of modules). I rely heavily on hierarchical modules to structure my code - i.e. lots of directories with __init__.py and either files or directories within them. The top level is on PYTHONPATH.

I was trying to work out how to structure a similarly large codebase in Julia. Several tools (e.g. vscode) don’t seem to support detecting (in the sense of code completion) several modules in different files in the same folder. I’ve seen various comments to the effect of “just split everything into packages” and do dev MyPackage. This seems a bit strange to me - these modules are often not nanturally separable into packages, I don’t want to have to build them separately, etc. I often want to be working on many parts of the codebase at once, etc. Also, I would like some sort of hierarchy for different submodules and logically grouped code - I can’t see a natural / out of the box way of getting that.

So: people who have actually large production codebases in Julia: how do you structure them?

Please don’t take this the wrong way, but for a 10k LOC codebase lack of clear separation into smaller pieces suggests that code organization could be improved.

You can do this, just dev them and use Revise:

It’s not clear why you need this. Large modules do get organized into submodules (Base has serveal examples), but the benefit for packages is unclear.

The python codebase is well structured and organised - things are separated, hierarchically, into modules. In python, it didn’t seem necessary to wrap up these modules into packages, since this didn’t give any benefit (there are no cases where we need to distribute packages - we are (and will remain) the only producer and consumer of the tools within this codebase).

Moreover, in Julia, putting every single module into a proper package of its own seems a bit overkill: this requires lots of extra cruft per module MyPackage/src/MyPackage.jl + Project.toml. I don’t see the benefit to this, since I never want to distribute these packages on their own: they are just small cogs in a larger tool and will always be deployed together. Also, I don’t quite see the benefit of having to dev any code I’m working on: I’m always developing! It seems more of a nuisance to me…?

I just looked into the example of grisu in Base. Fair enough - they achieve the hierarchy by manually including ever submodule in grisu.jl. A bit clunky, and another thing to have to remember each time you add a submodule.

I am not one of those people, but…

This is definitely the Julian way. A bunch of interconnected stuff is often contained within a GitHub org, but as stearate packages (cf DiffEq.jl). If the code is truly non-separable, why have different modules in the first place? (one might say)

This is totally understandable though. One common way to do this is to have them in separate files that get include()ed in the main module file.

Is also worth remembering that the Julia project structure/dependency management is much nicer than something like virtualenv or conda, so you can have really tight control over dependencies, and you don’t gain nearly as much keeping everything in one folder compared to spreading things out. That is, having MainModule, SubMiduleA, SubmoduleB is not necessarily more controlled than having ParentPackage and PackageA/PackageB that depend on ParentPackage. You can even have WrapperPackage that depends on and reexports everything in the others so you only need to do one add.

And in the end, you can do submodules if you really want - I’ve never done it so can’t give advice, but they are examples as @Tamas_Papp said.

This is just silly. They definitely should support this, and is worth raising issues with the vscode package or Juno if it doesn’t work.

2 Likes

This makes sense for open source stuff, but what about a proprietary production codebase shared between only a few people?

Presumably the source code already lives in a file, so the only extra file you need is a Project.toml. I don’t consider this a lot of extra cruft, YMMV.

It is hard to talk about benefits and costs in the abstract, without knowing your setup. But usually, being able to develop and test modular pieces of a larger whole is nice.

Eg consider the possibility that it is time to refactor SmallCog. Other packages that depend on it won’t be affected until the need to be, if you pin versions using [compat] in Project.toml. CI time will be presumably be much faster than running all the tests for the whole project. Having SmallCog isolated in a CI environment ensures there is no accidental/unintended coupling with other pieces. Etc, etc, just basic good software practices that you get at a small price of a bit of extra setup.

Indeed, if it is non separable, don’t separate. The reason DiffEq is like this is because it covers a huge domain with tools for very different people. There is no cross talk between the SDE solvers and the DDE solvers for example, and almost nobody actually researches in both domains simultaneously (though we might add SDDEs, but different story). I think about 2% of people just use individual DiffEq modules now, and those are all of the more Julia people (none of the core devs even keep DifferentialEquations.jl installed, and instead directly use OrdinaryDiffEq.jl or StochasticDiffEq.jl, or DiffEqParamEstim.jl directly). The rest we just point to DifferentialEquations.jl as a single point because it keeps it simple: here’s the docs, the tutorials, the benchmarks.

My idea is, if you want to reduce dependencies, it’s easy to find the page that says everything is in different repos and all modular. If you don’t care to look for that, you probably want the simplest and most documented solution, which is just one unified package.

My personal opinion is that you should start as one package and then as it grows add submodules. As those submodules mature into what can be standalone packages you separate them out. This is easier than maintaining multiple packages that are very immature and have a lot of change in terms of syntax and dependencies.

I’m mostly maintaining a lot of stuff for my lab work now though. So you should probably defer to the opinions of other more experienced github organization leaders here.

1 Like

Good question, and there are certainly people perusing these forums that have this use case. Hopefully they’ll weigh in. There are probably fewer public examples because, well, obviously those are proprietary.

Is it truly the case that you’re regularly developing all (or even most) of those tens of thousands of LoC and hundreds of modules? That sounds intimidating.

How is this different than manually creating a subdirectory and adding init.py? I personally find defining the stuff in the code way less clunky than being forced to use a particular folder structure, but to each their own.

To an extent, yes. For example, I’ll be working on some new script / logic, that relies on a pre-existing module. I need to extend it in some way, so I’ll go and make a modification to the source (and refactor all usages across the codebase if necessary, usually automatically with pycharm). This probably happens several times a day - the idea of having to dev before I modify each bit seems clunky.

This is a reasonable point, though it’s still a lot less work.

1 Like

I think you misunderstood something — the idea is that you dev your own packages once in the default environment, and then you don’t have to do it ever again.

3 Likes

Exactly. The way you (probably) do it in Python is the pip install -e /path/to/package mechanism, which I use when I work on my Python packages. You only need to create the “link” once…

The only difference to Julia is that with Revise.jl you don’t need to relaunch your session or scripts to have the code updated, it happens during runtime. In Python you need to relaunch the process after every change of already loaded pieces of code.

3 Likes

I suspect you may have to do this less with julia than with python. Adding a method to a function doesn’t require you to go and change all uses of this function, unless they all actually need the new method. That is:

julia> foo(a) = a+1
foo (generic function with 1 method)

julia> foo(1)
2

julia> foo(a, b) = a + 2b # extended function
foo (generic function with 2 methods)

julia> foo(1) # still works!
2

This get’s back to @Tamas_Papp’s point about code organization. It’s possible your project organization makes sense in a python context but should be organized differently in julia. Of course, you can organize things the same way, and that might make mroe sense given the extent of your existing codebase. But it may require you to do some cludgy tricks to make this work. I can see how this might be frustrating, but speaking for myself, I vastly prefer the julia way of doing things. I’m sorry it’s extra work for you, but I’m glad that julia didn’t try to follow the conventions of python in this regard.

Is it?

$ touch path/to/module/__init__.py

vs

include("path/to/module/module.jl")

A couple more characters, I guess…

1 Like

Am I missing something? Why not just use modules?

What is the difference between Julia modules and Python modules? What is preventing you from organizing stuff in the same way in Julia? Is it just the vscode completion that is an issue?

Yes, I suppose I can just use Julia modules with add them to JULIA_LOAD_PATH

I still would be interested to hear from anyone who has a genuinely large, closed-source Julia codebase on what they’ve found works best.

Why closed-source?

1 Like

Invenia has what is probably the largest closed source julia code base in existance.
Last count was about 50 odd close source packages, and about the same open source (not counting ones that are primarily maintained by others).

Invenias most significant application used over half of them (transitively),
almost all of them are able to be also used for research projects also.
E.g. to develop and test new algorithms.

We have a private registry.

The breaking up of things in to packages is great.

  • It means some parts of the system can be being improved by different teams on different projects.
  • New features can be added e.g. for a research project, but the main application can be a few releases behind – and that is basically fine.
  • The code-base can be updated piece by piece when packages we use have breaking changes, it doesn’t have to be rewrite everything at once.
  • The CI can test everything independently
  • In theory one could have very different merge permissions for different parts.

Breaking things up into packages is great

It does have a few downsides,

  • Occationally need to backport bug fixes. (because different versions are in use potentially)
  • Potential to want to be on latest version of a package, but that that one is not compatible with something else. yet. (of course without breaking it up, instead of not getting to be on latest, one would just have a broken build.)
  • People need to actually be careful about tagging releases (we basically use the continous delivery stratergy), and setting compat.
  • People need to understand dev and add and adding version and branches
19 Likes

How are you managing your registry? PkgDev does not seems to work with SSH yet (which is required for 2FA on Github) and https://github.com/GunnarFarneback/LocalRegistry.jl is cool, but it looks like a lot of manual work every time we want to tag something.

Registrator, via web-interface because its gitlab.
I believe people have Registrator github app working for private github.

2 Likes

Feel free to suggest improvements to LocalRegistry; it’s not finished. My mid-term vision for in-company registration of packages is that it will be done by CI, configured so that a new version is registered and tagged whenever a previously unregistered version number is found in Project.toml on the master branch and the tests pass. LocalRegistry will at the least be developed to support that scenario.

3 Likes