Just see the earlier discussion:
The most insidious is that it would effect I/O at loading.
Just see the earlier discussion:
The most insidious is that it would effect I/O at loading.
I could get behind @Skoffer’s idea regarding the keyword depends
. I would favor the word depends
over import
, since import
is already used to mean that you are importing objects from a module.
@patrick-kidger My apologies if the depends
idea is essentially what you had in PatModules.jl… perhaps I didn’t look closely enough at the details of that package.
The advantages of depends
, from my point of view:
include
statements.However, the example that @Skoffer provided is not the best example, because you can define your methods of foo
in whatever order you want. There’s no sense in which the A.jl
and B.jl
files depend on the utils.jl
file. (function foo end
is not more primal than any other method.) If we ignore modules that run imperative code, the only thing that matters is that types are defined before they are referred to.
Let me elaborate a bit more on why I don’t like turning every file into a module. Suppose I have a generic function foo
, with various methods, that gets used throughout a package DemoPackage.jl
. With Julia as it is now, foo
“belongs” to the DemoPackage
module. But if every file now has to be a module, I have to arbitrarily pick a submodule to “own” foo
, and then I have to import foo
from my arbitrarily chosen submodule every time that I want to extend it.
Below is a runnable example of what this would look like, using the current module system with nested modules. I think it would look basically the same if the nested modules in the example were separate files with the file-module correspondence enforced.
module DemoPackage
module A
struct S end
foo(::S) = 1
end
module B
import ..A: foo
struct T end
foo(::T) = 2
end
module C
import ..A: foo
struct U end
foo(::U) = 3
end
import .A: S
import .B: T
import .C: foo, U
export foo, S, T, U
end
julia> using .DemoPackage
julia> methods(foo)
# 3 methods for generic function "foo":
[1] foo(::S) in Main.DemoPackage.A at REPL[1]:8
[2] foo(::T) in Main.DemoPackage.B at REPL[1]:14
[3] foo(::U) in Main.DemoPackage.C at REPL[1]:20
Note how I arbitrarily picked module A
to “own” foo
. But in reality, none of the modules really owns foo
—it is a generic function who’s full definition spans multiple modules. There are other arbitrary choices I had to make here:
C
, I could have done import ..B: foo
instead of import ..A: foo
, because, now B
“owns” foo
just as much as A
does!import ..A: foo
, import ..B: foo
, or import ..C: foo
. It’s an arbitrary choice, because they all refer to the same generic function.To make matters worse, I’ve actually introduced an artificial code dependency that wouldn’t have existed otherwise. Look what happens if I transpose the definition of module B
above the definition of module A
:
module DemoPackage
module B
import ..A: foo
struct T end
foo(::T) = 2
end
module A
struct S end
foo(::S) = 1
end
module C
import ..A: foo
struct U end
foo(::U) = 3
end
import .A: S
import .B: T
import .C: foo, U
export foo, S, T, U
end
If I run the new DemoPackage
with the order of A
and B
flipped, I get ERROR: UndefVarError: A not defined
. You might say, “That’s exactly how it’s supposed to work. The code dependency has been enforced.” But the point is, there should not be a code dependency here, because I can define methods in any order I want! I could have done the following, where I can flip the order of defining foo(::S)
, foo(::T)
, and foo(::U)
to my heart’s content:
module DemoPackage
struct U end
struct T end
struct S end
foo(::U) = 3
foo(::T) = 2
foo(::S) = 1
export foo, S, T, U
end
To summarize:
Here’s another example that demonstrates the vacuity of a system like FromFile.jl
. Suppose we start with this:
# A.jl
function foo end
# B.jl
@from "A.jl" import foo
bar(::Int) = 1
foo(x) = bar(x)
Ok, great, now we know that file B.jl
only depends on file A.jl
. But suppose we add another file like this:
# C.jl
@from "B.jl" import bar
bar(::Float64) = 2
The new package with the addition of the C.jl
file will load and run just fine. But now the behavior of foo
is different, because we’ve overloaded bar
. File B.jl
asserts that its contents, including foo
, only depend on file A.jl
, but that is wrong! In fact, the behavior of foo
now also depends on file C.jl
, even though there is no @from "C.jl"
statement at the top of B.jl
.
To summarize:
@from
dependencies listed at the top of a file are incomplete.How does ownership of the function mesh with what Julia reports as parent module for foo
?
julia> parentmodule(foo)
Main.A
julia> parentmodule(B.foo)
Main.A
Even though foo
gets imported in B and C isn’t that merely bringing in a function from a different module as an alias in the current module? Sure, each module owns a method of foo
, but I would actually say that neither B or C own the function (although they clearly influence its definition).
Edit: ah, just noticed that parentmodule()
can take a second parameter for specifying a particular type signature for a function. When left off it reports the first definition of that function.
I think you misunderstand. I can’t speak for others, but
using
etc from the file system. I consider this an elegant feature, which would be nice to keep.It is implicitly assumed that difficulty of internal organization of packages increases with the amount of code in a package.
IMO maintaining complex and large packages is key to this discussion. After all, for a small package everything can go in one file, or maybe a few files with a couple of include
s, and the issue becomes less pressing. It is no coincidence that a lot of package authors have a difficult time understanding the motivation for the proposal.
So @Skoffer raised a similar point before. I agree this seems like a weakness. I would note that the current approach doesn’t really change this though. It just hid it from you, by “implicitly importing” everything when you use include
. (So to speak.)
FWIW, in practice I find most of my uses of multiple dispatch do involve a natural “owner” of the function, whilst the others are merely seen to be extending it with additional methods.
This at least shouldn’t be an issue. Part of the current discussion is that this kind of topological sorting no longer becomes something needing manual resolution: the statement for import A
will trigger the loading of A
.
Yep, I’m aware of this example. I would respond by noting that:
foo
(if not which method). This is more than you could say before.# main.jl
include("A.jl")
include("B.jl")
...
include("Z.jl")
# A.jl
def foobar(...) = ...
# Z.jl
def foobar(...) = ...
in which A.jl
creates a function with a very generic name (foobar
), and then Z.jl
, instead of creating a new function, actually creates a new method. Potentially changing the behaviour in the almost completely unrelated A.jl
. Very easy to imagine happening in larger projects.
The key point is that – to use SciML as an example – it makes sense to split things up into multiple packages. The components are large and decoupled enough that this is sensible. So “what if we made SciML a monorepo” isn’t necessarily a good example.
Probably the best examples would be commerical software products (based on my experience with other languages). Projects which are large, complex, and which your latest junior dev can contribute to, whilst only needing to understand a subset of the codebase. And which the language largely isolates from making changes that affect the rest of the codebase. My interest in this feature is really about making this kind of commercial-scale stuff easier to manage.
I haven’t read the full series of posts, but for a large, complex codebase that went the monorepo route:
It uses a lot of nested submodules that can import from each other via using ..
.
For example, the MathOptInterface.FileFormats.MPS
sub(sub)module can use things defined in MathOptInterface.Utilities
submodule:
If you want to be sharing some utils.jl
file between other files, make it a submodule and import that.
Maybe I need to read the full series of posts…
Sure, but that’s not the question here. The relevant point is that SciML, JuMP, JuliaImages, JuliaSmoothOptimizers and many similar projects are already large and complex enough that their principal architects are in a good position to speak about the benefits of a proposal like this one from their own perspective, and it would be interesting to hear their opinion.
Sure we can move the goal post, but Pumas is a great example for this. It’s a already quite a “commerical” success (commercial in America California?): it is already one of the FDA eCTD standards for clinical trials, and remember “Pumas has emerged as our “go-to” tool for most of our analyses in recent months” by Husain, the head of clinical pharmacology at Moderna in 2020. Similarly includes work with Pfizer QSP team, United Therapeutics, etc., so full enterprise and something that’s really out there now with a relatively good sized team of full time engineers.
It’s built the same way as SciML with separate repos for separate functionality, though there is a slightly larger core in Pumas.jl. It’s interesting to see why it’s done this way though. In the Pumas case, it’s pretty necessary because the alternative repos are related to differently licensed products, such as Lyv, OptimalDesign, or the coming Pumas-QSP. So the split is not just for development reasons, but also for licensing and access rights. It also helps with the marketing as well, because then specific areas can be branded, talked about in workshops, and focused in videos with clear naming schemes.
JuliaSim is the new modeling and simulation product coming out of Julia Computing, with a quick overview here and a deeper technical discussion to occur at the 2021 Modelica Conference and at JuliaCon 2021 (register everyone!). It again has this same development split, this time not because of licensing (because instead the coming model is a cloud license to the full suite), but because there is a major separation of concerns between HVAC models and electrical circuits.
If you look at similar simulation-based products in the area, such as Modelon’s model libraries and Ansys products, they similarly have this form of modularization, this time for licensing reasons again, but also for marketing. If you look at the marketing materials that are online, they are per module and almost certainly just pick up the portions relating to a given industry/customer quickly piece together new demonstrations. Many times you can then have the sales team divide into specific specialties, which makes sense as they won’t have the technical knowledge behind a lot of the work so focusing on one area will allow a bit more depth (and building a better network).
Thus while I know Google famously uses a monorepo, FAANG companies are rather outliers and recognizing how and why they are different is important to business success. Monorepos do have the advantage of immediately notifying downstream effects of a PR, and when people are employees you’re much more able to say “you have to fix that by Friday, pull in X from the deployment team if you don’t know how”. And for a purely cloud-based product where you don’t have to license things separately, it does make the deployment of a single sign-on much easier. But then again, it does add developer burden, which Google is able to handle by giving famously high salaries and hiring some of the best engineers there are to offer, so if your PR in ML packages breaks an RSS feed in the webserver, you’re supposedly smart enough to be able to cross domains and fix it.
Google also writes a large fraction of its own dependencies, and so not requiring a modular system can make sense. Counter that to these Julia-based organizations which require a form of modularity from the start because they are already based on open source software (such as the SciML organization), and so even if there was a monorepo there would still be dependency management required. That said, monorepos do seem pretty enticing for cloud products if you’re mostly focusing on the deployment side.
I’m also interested in these questions from an industry point of view. I won’t claim that we have a huge amount of Julia code, nor are we a large company, but we have enough code that scaling issues need to be considered.
In my experience include order is a small irritation. Not a non-issue but also not a significant one. I have never once seen a double include issue and in fact have trouble understanding why people are even talking about it (no need to rehash that, it’s there somewhere in this thread). I have some sympathy for the viewpoint that it should be easy to predict or find where a function is defined but not to the point that I want to see a forced mapping to filenames.
To some extent this can be influenced by the fact that I’ve found submodules to be rather unergonomic and much prefer to split code into packages. However, even if Julia packages are rather lightweight, it doesn’t scale to split all your code into a large number of micro-packages. There’s a bit too much ceremony around creating a package, the registration overhead is not negligible (even more so in General than in a private registry), and it’s a mental burden to keep track of hundreds of small packages and their versions and compat. With subdir packages you can avoid also having hundreds of small repositories, but that’s only a minor gain.
For a concrete example we have an internal file format implemented in a Julia package. This consists of a write function and a read function. The writer is slow, requires a lot of dependencies and is basically run once for a given set of data, offline. The reader is fast, used online and needs no dependencies. The fact that users of the reader pull in lots of irrelevant dependencies could be solved by splitting out the reader into its own package but it’s not really attractive. The writer and reader are inherently coupled, need to be tested together and if the file format is changed in some way they need to be updated in tandem.
What I would like is to have the power of packages available at the submodule granularity, with less overhead than full packages. I.e. a subpackage concept that could have the following characteristics:
Subpackages relate to the directory structure of a package. When
you import P.A.B
Julia will load the file <path to P>/src/A/B/B.jl
and import
the module B
, which must be defined in the file. Loading of subpackages should not require the full package to be loaded as well.
Subpackages share UUID, version number, and Project.toml
with the
main package.
Subpackages can have different sets of dependencies and other
packages can depend on specific subpackages. compat
is shared with
the main package and can only be specified at the main package
level.
Only the main package is registered and subpackage information is
stored in the registry files for the main package.
And yes, I’m aware that this might be breaking and/or fundamentally difficult to implement. There are many details that would need to be worked out and tested. But I would find it a lot more useful than other ideas that have been discussed in this thread.
The current subpackage infrustructure is close. JuliaImages used to do this but went back IIRC. KernelAbstractions.jl is a good example that uses this for the CUDAKernels and ROCKernels subpackages, which live in the same repo but are different registered packages with different dependencies, so it has a separate Project.toml. Indeed I think this structure could be improved with shared Project tomls when dependencies are the same, though I wonder what kinds of extra complexity having such a system in the package manager could cause.
Not really. The usual terminology is subdir packages and there’s nothing “subpackage” about them. They are fully independent packages which just happen to be co-located within a common repository.
(The second use of subdir packages is when you have one Julia package which only is a part of a repository, e.g. a Julia wrapper package for a C library in the same repository. This use case works great and was the reason I helped push subdir packages over the finish line for Julia 1.5.)
I just mean close in terms of implementation distances. To go from this to something that can opt-out of having different dependencies and ship with the main package doesn’t seem to far off, though it could add some odd complexity.
Yes, I understand that. The point was that the dependency is artificial. The methods of a generic function can be defined in any order, so B
doesn’t really depend on A
, it’s just forced to import foo
from A
to ensure that B
overloads the same generic foo
.
If I’m reading a file and I see the line baz(x) = foo(x)
, now I know that this file depends on the function foo
. And because all functions are generic with global method tables, the actual behavior of foo
depends on any number of foo
method definitions which can occur anywhere before or anywhere after the line baz(x) = foo(x)
(even in a different module!). An import statement like from A import foo
implies that everything I need to know about foo
can be found in A
, but it’s just not true—not even close to being true.
(Luckily the mental burden of understanding foo(x)
is not that bad, because generic functions are given generic docstring definitions that are intended to apply to each method. In other words, foo(x)
has approximately the same meaning regardless of the data type of the input.)
If I have all my foo
definitions in the same, flat, package namespace, I don’t think there’s any “implicit importing” going on. It’s exactly the opposite, in fact. It’s all the same namespace, so of course I don’t have to do any importing (implicit or not).
I want to add a slightly different perspective. I think solving this issue can also be beneficial for small projects. I myself work on agent-based models of social systems with code sizes that are minuscule compared to many of the projects mentioned here.
However, it would still be nice to be able to split my code into independent units without having to jump through hoops. As an example:
In a given project I might - besides the main model - have some code that generates an artificial geography, code responsible for setup, a belief dynamics sub-model, some bits that are responsible for the gui, etc. All of these parts cover well-described and separate functionality, so they should live in their own modules which I would like to keep in their own separate (local) files/directories. So far so good, if I want to use any of these modules I just do the not pretty but workable include("M1.jl"); using M1.jl
.
The problem is, there might be cross-dependencies between these modules. For example, the belief dynamics might be used by the main model as well as the setup. If I keep doing include
+using
at point of use I might start running into double include issues. The canonical solution for that at the moment seems to be to have some sort of main.jl
file that include
’s all of the single module files and then do using
to my heart’s content wherever necessary.
In my opinion that’s not a very satisfactory solution, however. It makes it impossible to track dependencies locally, i.e. at the point where they are used. If I look at my setup code for example I want to see immediately that it depends on Beliefs. I could have the using
locally but then I would have the dependency split over two completely separate places (with the include
still living in main.jl
)!
This becomes even more aggravating if I have different configurations of the model that use different configurations of modules. I might for example have a run_alt.jl
that uses a different setup that does not depend on my geography module. Now I need a different main.jl
for a different combination of module files. I find having to keep track of these dependencies globally quite unintuitive and error-prone.
My workaround so far has been to globally (i.e. somewhere at the beginning of the main script) add the current directory to LOAD_PATH
and then just use using
wherever necessary and/or convenient. It works, but it would be nice if the language offered a more canonical solution.
P.S.: This is my attempt at a solution.
Can you explain “so they should live in their own modules” part? What are the benefits of putting everything in a separate modules? I mean, it is reasonable to put everything in a separate files - it helps with readability, but what is the purpose of wrapping everything in modules? The only reason, I can think of, if you have functions that have a different meaning, but overlapping signatures, then yes, modules are helping you to distinguish them. But other than that it seems that putting everything in modules wouldn’t do any good, quite the contrary, it adds artificial restrictions, lots of redundant using/import
, and interfere with multiple dispatch.
Hmm, to be honest putting self-contained functionality into modules seems so self-evidently reasonable to me that I haven’t really thought about it explicitly so far.
I think it’s basically the same reasoning as that behind writing separate classes for separate things in an OOP language or having as few global variables as possible. You want to minimise the potential of unintended interactions. That avoids bugs but it also serves as a cognitive help. When I am working on some self-contained code that is inside a module (as opposed to an “open” file) I know that the only interaction it will have with the outside world is via local imports and the things the module itself exports (not strictly true of course, but if people start messing with module internals they are on their own), making it easier to understand. The other effect is that it nudges me towards increased discipline. A module is a very strong declaration of intent to keep the functionality limited to one specific thing and thus counteracts the temptation to “just put that bit there” because it’s more convenient.
Of course, as usual, none of this would be necessary if we assumed a perfect genius programmer, but then that perfect genius programmer wouldn’t need Julia but could just write in assembly. In a way most of what happened in applied computer science since the 50s or so is developing cognitive crutches for imperfect humans.
I think your considerations are reasonable. But with the possibilities that Julia offers right now, a good way of thinking is:
Your modules only exist/are relevant to your own package. Thus, it might not be a big inconvenience that your main module starts with:
module MyPackage
include("./mysubmodule1.jl")
include("./mysubmodule2.jl")
...
end
where, for example, mysubmodule2.jl
has
module MySubmodule2
using ..MySubModule1
...
end
without repeated includes. Maybe not ideal, but since your submodules are only relevant in the context of your package, that is not too bad. (or use FromFile.jl, mentioned here many times).
If, alternatively your modules might be useful outside the context of that specific package, then make of them new packages.
Or just set LOAD_PATH
to include the current directory, or use the macros shown here.