I am in the process of converting a fairly complicated proprietary simulation model that I wrote in Python to Julia 1.5.3. The code is broken into around 20 different sub packages on the file system by their purposes for organization and reusability. Each sub package also contains a tests folder that contains the unit tests for the different modules inside the sub package. In Python this can be done very easily because each .py file is a module, each directory is a package containing these modules, all imports should be done explicitly (ie: from foo_dir.foo import stuff at the top of bar.py) and you can configure PyTest to systematically find and execute all unit tests in all packages in the entire project folder tree. I have spent a couple of days trying to figure out how to do all of this properly in Julia. Couple of things I have tried:
A. Use include("relative path to foo.jl") directly at the top of bar.jl
Very easy to use
Very easy to understand.
Both Juno and VSCode intellisense plays well with this method.
If foo.jl moves on the file system modification process is initially simple. However this quickly becomes a pain if a lot of barN.jl includes foo.jl all over the place. Python actually suffers from the very same problem but PyCharm tends to refactor this automatically so it’s not a super big issue.
Can cause redefinition of global constants in foo.jl if include(foo.jl) were called multiple times due to multiple inclusions. No control over what’s included from foo.jl and what’s not which means name collisions can happen easily. This will basically make it useful for just very small projects. This is a deal breaker.
B. Wrap code in foo.jl in module Foo ... end, export desired public stuff from foo.jl. Import with using .Foo.x, .Foo.y or import .Foo.x, .Foo.y
Pro: AFAIK, none! More on this in the Con section.
This only “looks” more like what Python does but in order to actually import/using the module it either requires include("foo.jl") just like A (and therefore inherit all of A’s cons) or adding the code to path.
Actual name from Foo can live under all kinds of weird prefixes if done without Reexport and the @reexport macro. This feels hacky and tbh, pretty unintutive.
C. Build sub packages into actual packages. Do generate foo_dir/foo under Pkg mode and use dev foo_dir/foo in bar’s environment. Any sub packages (ie: bar) that want to use/extend foo can do import Foo.x, Foo.y etc in its code.
This is by far the most well behaving solution and the one I am leaning towards. In practice everything pretty much behaves like how I want it to because AFAIK it’s kinda like “installing” the Foo package into the parent environment (in dev mode but whatever).
I haven’t tried out setting up all the unit tests for different sub packages yet but I imagine it’s pretty much the same process as setting up the unit tests for just a single package.
Environments everywhere. Every sub package gets its own environment with its own dependency (ie: 20 different Manifest.toml and Project.toml in the project repository at various locations). This also makes me worried about the possible clusterfuck I might have to deal with when some foo.jl inevitably has to move out/get merged from/into another (new) sub package during development.
It’s a pain to generate the proper file system tree for the sub packages. First of all the source files for each sub package now live under the subpackage_dir/src/ directory instead of at the subpackage root dir. TBH I can live with this. The other problem is in order to generate the package at the correct file system location you have to be careful with which environment you have activated right now. This is quite a bit more thinking and running commands and swapping environments than I like for something that should be very simple.
VSCode intellisense seems to play with this badly. I am guessing this might have something to do with the cache its language server generates is based on hashing the version number of the package. Since the sub package isn’t really a package that is being distributed over some registry, its version number won’t be updated for different releases either (and therefore cache doesn’t get regenerated when say, foo.jl changes). Again, just a guess. At least Juno seems to do intellisense properly under this method.
It’s entirely possible I have missed something simple and obvious. Is there a recommended way to deal with this?
Yeah, definitely a deal breaker. And, in fact, it’s even worse than you’ve said because any types you define in foo.jl will be re-defined each time it is included, resulting in incompatible types with the same name. Ever file must be included no more than once, period.
This sounds pretty appealing, although I agree it adds some complexity. I think an important question that only you can answer is: Is each sub-module truly an independent entity? That is, is it something you imagine working on completely in isolation, managing its own dependencies and perhaps even installing on its own? If yes, then this makes sense–the sub-module is its own package, therefore it must know what it depends on (and therefore it must have its own folder, its own Project.toml, its own tests, etc). If no, then perhaps this sub-module is just a logical chunk of some larger project, but not something you’d actually want to install by itself. In that case, I’d propose something like what JuMP and many other projects do, in which there are some sub-modules but they are all included exactly once by the main JuMP.jl file, e.g. https://github.com/jump-dev/JuMP.jl/blob/master/src/JuMP.jl and a sub-module here: https://github.com/jump-dev/JuMP.jl/blob/master/src/Containers/Containers.jl This avoids all the extra src folders and .toml files, since it treats the sub-modules as parts of a greater whole rather than standalone projects. It should also work better with VSCode’s intellisense.
I suspect the latter is closer to what you’ll want–after all, unless you actually had setup.py or requirements.txt files in each of your Python project’s sub-folders, those components are not really independently installable either.
Really the only downside of the latter approach is that it doesn’t provide an automatic way to test only the sub-set of code in one of those modules. Projects using the structure I proposed above (like JuMP) almost always organize their test folder to match the hierarchy of src, which can allow to you to test specific chunks by including only whatever subset of those files corresponds to the module of interest. I agree this isn’t as nice as being able to pytest foo.bar, but I have found it to work well enough.
If you are starting such a large project (approx 20 modules), I suggest taking advantage of LOAD_PATH instead of using pkg> add or pkg> dev. It is the easiest way to learning Julia development, and you can migrate to the pkg> system when your are closer to publishing your solution to Julia’s General registry (if you so desire).
Quick dev: 1 file per “software module”:
If you are migrating from Python, you might want to try the 1 file per “software module” solution:
If you do this, you should no longer usepkg> add Module1. LOAD_PATH takes care of making it available to your project.
Typical package directory structure:
But to be able to migrate to the pkg> system more smoothly, I suggest you use the proper Julia directory structure from the start. Among other benefits, this structure (optionally) includes test/ folders for ci tests, and more clearly groups together package solutions that are split across multiple files.
Actually, the good news is that of literally today, you now can get Python-style from syntax in Julia. (It’s even a little bit better, as it doesn’t have a couple of the edge-case warts Python has.) Then you can organise things pretty much like you would in Python, without the various complexities described above.
The package for doing so is FromFile.jl. (It’s not been registered yet so look up how to install from GitHub, but it’s fully tested and as far as we know bug-free!) If you’re curious, FromFile.jl is a draft implementation for Issue 4600, where there is an ongoing discussion about how to solve the exact issue you’re describing.
There’s not much point in rewriting a Python package into Julia if you’re just going to do a word-for-word translation. I would step back to see the bigger picture. Ask yourself these questions:
Are there similarities among the sub-packages that can be codified into a generic interface?
Are there common operations (methods) that appear across different sub-packages?
The single most important feature of the Julia language is multiple-dispatch. Generic programming and multiple-dispatch are the core of the language. Dividing your code into dozens of modules works against this. Try to find generic functions that make sense for your domain and then overload them as needed. Whereas large Python packages contain complicated trees of nested modules, Julia packages are relatively flat in order to leverage generic functions and multiple dispatch.
Suppose your Python code contains methods like this:
Thanks for the advice but I am pretty aware of the mechanical difference between the two languages in that regard already. The point of the rewrite is for runtime performance and the rewrite is certainly not a word for word translation, precisely because of multiple dispatch and the “data only” OO model in Julia. A significant chunk has already been rewritten (in a Julian way) and benchmarked against their python counterpart. However I do want to preserve the overall code organization structure already adopted by the python code because the structure represents (nested) logical components of the problem it is trying to solve. After some experimentations I think it is possible - more on this later.
Here’s the solution I tested and will probably adopt. Basically it’s “add all possible paths in this project to LOAD_PATH and modularize all necessary entry points”.
Create a setpaths.jl that contains the following:
if !isdefined(Main, :__setpaths__)
global const __setpaths__ = true
return collect(Base.Filesystem.abspath(stuff) for stuff in Base.Filesystem.walkdir(root))
for path in paths
paths = getpaths("$(your project root here)")
This code should be guarded so can be included multiple times. Include this file somewhere at the top for any executions.
Basically just wrap module entry point code in module ... end and use them directly in other codes like an installed package module (without the . prefix) wherever you like. I also found out that under this method, a) Module entry point must be stored in .jl file named the same as the module name. For example, say entry point code is module Foo ... end then the file must be named Foo.jl. b) Entry point can only contain one module definition. Not sure why.
(Almost) all of the benefits of the sub-package through generate/dev techniques, none of the complexity.
Modules are all “flat” in global so if you have a lot of nested structure you can become lost with no ideas where to find the code. Then again, even if you go with the sub-package technique you will have the same issue.
Breaks all intellisense/autocomplete known to men when not developing within a module. Combined with the above it can be kinda devastating.
I can’t really change the organization structure too drastically because then the junior programmers who are not as well versed in the design of the solution will become entirely lost when they can no longer use the more familiar python code base as a reference. But let me give a very vague example demonstrating how the structure came to be.
Suppose you are trying to solve a problem. Let’s say part of the problem involves finding out how much an insurance company will pay for a particular claim. There are different types of claims. Regardless of the specific type of claim they all must follow a flow of procedures. However the specifics of the procedures can be different. Yet every procedure still shares certain similarity.
This leads to the following natural nested structure:
Note that it makes sense for Procedure to live under Claim because it is about procedures specifically related to the Claim and nothing else. Also, it makes it much easier for the person reasoning about the solution to organize the code because it has a close correspondence to the logical structure of the problem.
I actually did try FromFiles already. I ran into an issue when I was trying out some weird imports from upper level filesystem hierarchy. I will see if I can reproduce the problem and either figure out if it’s me or an actual issue.
Is the approach of loading things from files as opposed to packages and modules possibly going in the wrong direction? How do we know we can trust stuff that can be pulled out of files? The file may be a script for all we know.
If we’re trying to access a file’s contents, we should already know whether or not it is a script. (And I’d note that the same is already true of the current approach when accessing a file’s contents via include.)
Yeah rest assured I am not advocating for the ability to import from any arbitrary files, just user source files under the project root folder. If the user can’t figure out which file he should/shouldn’t import despite having direct access or even having written the code himself then I’d say he has no business fooling with the code in the first place.
Also, I am pretty sure FromFile does not deny the user the ability to manually specify modules as usual (and subsequently import said module using @from). That is probably the proper way to use it.