[ANN] PatModules.jl: a better module system for Julia

patrick-kidger · December 22, 2020, 5:40pm

Fair comment!

I don’t think I agree. For example I’m writing some code for a variety of neural network models. In common.jl I want to factor out MLP, and then use that as a building block in both neural_ode.jl and neural_sde.jl. The solution to this so far has been to rely on both neural_ode.jl and neural_sde.jl being included in just one place, and have that place include common.jl on their behalf… which is the kind of lack-of-dependency-tracking that I’m not a fan of.

Haha, thankyou!

ToucheSir · December 22, 2020, 6:00pm

I think it’s worth unpacking the nuance behind that question. Namely, why #include is such a pain in C/C++, why Julia seems to have includes despite being a newer language and if the aforementioned pain points apply.

From a modules/namespacing perspective, the C compilation model has 4 main quirks:

There is only one global namespace. C++ adds lexical namespacing, and as you can imagine that helps immensely.
Declaration and implementation are split between header and c/cpp files. As I’ll address in a second, this is a minefield and practically no modern language retains this approach.
Includes are not syntax aware and can paste code anywhere. Think using an include to add the signature for a function!
The default object-based linking model necessitates inclusion of the same header in multiple locations. This dramatically increases the chance of multiple inclusions per object and thus header guards.

Now to Julia. Julia’s module system is, as far as I can tell, a near copy of Ruby’s. Contrasting to the C/C++ system:

Namespaces are pervasive and not purely lexical (modules are first class “objects”).
Declaration is almost always implementation and thus source files are included directly.
include is syntax-aware and must only introduce fully-formed constructs (types, variables and functions for Julia). No literal copying and pasting of partial functions.
Most importantly, code is extracted into external files and then included only as a space-saving measure. This is in contrast to something like PHP where common includes were/are used all the time. Conceptually, this means that reversing all includes and inlining everything into one .jl file is semantically equivalent. It also means that duplicate definitions are essentially non-existent unless someone purposefully writes them out multiple times.

Conceptually, you can think of this as a “unity build” in C/C++ projects, where source files are included directly and only limited headers are required for external dependencies. Incidentally, unity builds do not suffer from many of the #include-related pitfalls that normal C/C++ projects do, but face cultural aversion and a lack of tooling support. Neither of these concerns are present in Julia.

One point I do agree upon is tooling and discoverability. Here though, I’m not sure include is primarily to blame, but using. As an example, compare browsing through C# and Java projects on GitHub. The former only has using and thus makes finding what comes from where difficult, whereas the latter primarily uses import.

However, I’m going to pin this on the tooling and not the language. For example, looking through Python projects was a pain before the new go-to-definition functionality because of varying PYTHONPATHs. Github’s search functionality is also a dumpster fire on the best of days.

Thankfully you don’t have to. JuliaHub is an invaluable resource (I would say essential infrastructure at this point) even when using an IDE. For local code, there is a LSP plugin for pretty much every mainstream text editor out there.

Philosophical tangent/rant: Despite being a big fan of Vim, I believe the “UNIX is my IDE and text is all you need” crowd has set us back at least 20 years in PLT and PL tooling design. It’s nice to see a) C losing mindshare and b) fewer languages catering to that crowd.

PS: Neural CDEs are great

Roger-luo · December 22, 2020, 6:18pm

I think people have been mixing up the concept of a file and a module - they are two different things：

when one calls include that means include the file, it should allow including the same file multiple times, cause this is what include means
when one load a module, one should be able to load the module multiple times without creating multiple definitions

The concept of module/namespace has been mixed up by people in many different languages, and for languages like Python, it is mixed up intentionally by the designers. However, they are not exactly the same thing.

I 100% agree that a better practice is not to use include at all, this is something that by definition requires the programmer to manage files and code dependencies that can very likely to go wrong instead of handling by the compiler, e.g the order of include needs to be carefully handled, this creates extra burden that shouldn’t be there.

split and wrap into a package is workaround, not a solution, because this feature is not implemented yet.

There has been quite a few issues addressed in issue 4600, this feature should be about importing not including. I feel we are repeating what has been discussed around 4600 in this post. Maybe people just want to read the discussion in that post first and other issues referred to that post first.

StefanKarpinski · December 23, 2020, 4:20am

The splitting of Pkg into modules is very annoying and something I should really undo at some point. They all just import each other and it adds no useful structure.

aplavin · December 23, 2020, 9:59am

Interesting… I found Pkg source reasonably understandable to navigate, and the distinction between e.g. API.test() and Operations.test() makes sense.

StefanKarpinski · December 23, 2020, 11:20pm

The API module does make sense, but the separate modules for Types, Operations, etc are just a mess. Every file has a huge stack of imports from sibling modules at the top that aren’t necessary at all because it’s not like there are name collisions. And adding new functions or types is unnecessarily annoying because the pointless module structure makes it hard to figure out a good place to put things.

Tamas_Papp · December 25, 2020, 6:27am

I think that this is a misunderstanding: modules usually include their own files, while all other code usually uses the code loading mechanism (ultimately using / import with project files).

patrick-kidger · December 25, 2020, 11:03am

I shan’t try and address all the points above, but to head off this one - there’s no misunderstanding. What you are saying is precisely the problem: include is not a good way to organise code, as it (a) enforces that the files of your module be a tree rather than a DAG; (b) demands that the parents in this tree include things on their (sub-sub-…) children’s behalf.

Tamas_Papp · December 25, 2020, 11:33am

I am not sure what you mean here, since directed trees are DAGs.

If this is a problem for you in practice, that’s usually a good sign that you should organize your code into modules, and let the loader build up the tree/DAG. This works fine.

Generally, I am afraid that you were a bit hasty to conclude that

since this is not a problem for Julia programmers in practice.

patrick-kidger · December 25, 2020, 12:11pm

Correct. But not all DAGs are directed trees.

The loader is incapable of building a general DAG. This will result in duplicate definitions.

I stand by this statement. But I can certainly tell that that’s not the prevailing sentiment here. Something which astonishes me, frankly. This isn’t even a debate in any other community.

I’ve seen the code Julia programmers write in practice. (In the various major packages.) Respectfully, it is not of the quality I would expect from a modern language that claims to have things worked out.

kevbonham · December 25, 2020, 12:23pm

I think you mean to say that it’s not organized in a way you find optimal. Surely you understand that saying the code in all the major packages you’ve looked at is low quality is not, in fact, respectful.

kevbonham · December 25, 2020, 12:54pm

If I do using Tables, and then using DataFrames, the later of which also does using Tables, there’s no duplication of definitions, is there?

You might take this as an opportunity to evaluate some of your assumptions. Given your initial statement that you love everything about julia except for this, I take it you recognize the care and thoughtfulness with which the language was designed. It is certainly possible that we all have blinders on and this really is a wart that needs addressing (if so, kudos for trying to address it!). But might it also be possible that there’s something you’re overlooking?

I have had countless experiences with this language running into something that seemed like a mistake, or was unintuitive, but in almost every instance, once I spent some time reading up on the subject, asking questions here or on slack, and learning why things were the way they are, I came to appreciate the design.

I’m struck by the fact that you created your discourse account a week ago and don’t have any other posts asking about how people organize their code, how to avoid duplicate definitions etc. I’m not everywhere on slack and zulip, but I don’t think I’ve ever seen you post there (we have #gripes channels, I bet you would have received good feedback there). It seems as if you looked around, judged a bunch of code you saw as low quality, then jumped on here announcing your “better way.” If nothing else, I think it’s a bit naive to assume there wouldn’t be pushback.

I think it’s great that you saw a need and tried to fill it. That’s the kind of attitude we want in this community. What I don’t think we need is someone that sees only one right way to do things, fails to solicit or listen to feedback, and insults people that work in a different way.

aplavin · December 25, 2020, 1:50pm

I’m not the OP, but one of inconveniences with the current include system is that there is literally no way to tell what are the dependencies of a specific source file. So one needs to dig through other files (github search help here, of course) in order to copy a part of code to another project.
An obvious alternative could be to require each file to be a self-contained “module”, even without explicit top-level module declaration.

johnmyleswhite · December 25, 2020, 2:45pm

I’d suggest all participants take a break on this thread until the New Year.

StefanKarpinski · December 25, 2020, 6:25pm

Welcome, @patrick-kidger! Rather than take the opinion of StackOverflow user395760 as a given, it would be good to understand why the current design is problematic in your view. What concrete issues have you observed it causing?

cce · December 25, 2020, 6:48pm

I like that include is simply a source-code management tool (e.g. breaking a big file into a bunch of smaller related parts) that doesn’t impact system architecture. This is a feature, not a deficiency.

Tero_Frondelius · December 25, 2020, 10:32pm

This doesn’t sound like the right way of working. You should use import AwesomeModule: functiontoborrow syntax instead.

patrick-kidger · December 26, 2020, 2:57am

Okay, quite a lot to unpack here. Some quotes-with-answers deliberately out of chronological order for better presentation.

I’ll start off by apologising if I’ve come across the wrong way. I certainly don’t mean to offend anyone. Clearly I have a controversial opinion – I am trying to express disagreement without derogation.

I’ll restate that for emphasis: I absolutely don’t mean to cause offense.

Thanks for the welcome! Okay, let’s get into the meat of this.

I’m constructing a module/package/some large blob of code.
I have two files A.jl and B.jl, which depend upon some common functionality. The typical pattern is to factor this out into some other file, in my case often with an unimaginative name like utils.jl.

In order for A.jl and B.jl to see the definitions of utils.jl, they must both include("utils.jl"). This poses a problem: they cannot both perform this inclusion. Eventually both A.jl and B.jl will themselves get included somewhere, and then utils.jl has been included twice. The problem with this approach is the problem of duplication of definitions.

For example if this occurs within some module hierarchy, then we can end up with two distinct copies of the contents of utils.jl, contained within different modules. This isn’t a huge issue if utils.jl only defines pure functions, but if utils.jl defines some types, with functions dispatching based upon these types, then the copies are mutually unintellegible: you cannot dispatch to functions defined in one copy using the type defined in the other.

The solution is apparently to include both A.jl and B.jl in some other file, say entry_point.jl, and require that entry_point.jl will include("utils.jl") on A.jl and B.jl’s behalf. Indeed this is the standard pattern within several major projects, and I imagine the pattern that most people here are familiar with.

Unfortunately, this has its own problem: A.jl and B.jl are no longer self-contained. If A.jl wishes to use some function foobar() defined in utils.jl, then it simply uses it without qualification, trusting that it will be made available for it. This is the problem of not being self contained, which means that the dependency structure between files is not made explicit.
This implies several problems:

The code becomes harder to read, and to reason about: each file is implicitly assumed to be executed in some unspecified context.
It is harder to locate the functionality you are depending upon; as others have noted above this typically requires something like IDE support to track down.
Additional manual labour is required to ensure that entry_point.jl runs its includes in the correct order.
It becomes harder to locate old/dead code that isn’t depended upon by anything.

And moreover these issues are generally exacerbated once multiple developers are involved.

I don’t think these issues are controversial – from earlier in this thread:
@oxinabox: “… It’s a fair complaint.”
@aplavin: “one of inconveniences with the current include system is that there is literally no way to tell what are the dependencies of a specific source file”
(If either of you feel I’m misrepresenting your point of view here then do please let me know and I’ll take it out.)

So whilst the limitations of this approach are to some degree manageable, they are limitations, and ones with increasing bite as project size grows. It is not overstating my position to say that I think this is the single biggest limitation to work around when using the Julia language; at least that I’m aware of.

As an explicit example, try having a look through the source code for PyTorch. The Python bits (which follow the first pattern) are generally easy to follow. The C++ bits (which follow something akin to the second pattern) are generally difficult to follow.

Do note that ultimately this all an issue about handling files – not modules, nor packages. (Despite the title of this thread – the focus on modules has been because they can be used as a potential solution.)

So what is the solution? (Beyond just putting up with it.) As far as I can tell, until now there hasn’t been one. PatModules.jl is one (deliberately simple) approach, but not one that I’m particularly wedded to. I think if a solution to this problem made its way into the language as a whole I’d probably advocate for a different more sophisticated option. But I shan’t get into that now – let’s focus on establishing whether there is an issue or not first.

Does that all make sense? What are your thoughts?

Correct – because both are installed as packages. (In this scenario Julia keeps a global reference of all imported packages and re-uses them if possible.) This discussion / my point is focused solely on the construction of a single package (or more generally some complicated blob of code), and ways to split code across multiple files when doing so.

Quite possibly I am wrong. I haven’t been convinced otherwise yet, but I promise you I am reading every reply, and trying not to be a zealot about anything.

I spent a fair bit of time searching around looking at existing solutions to this problem, and existing thoughts on how things may be improved:

Current recommended best practice 1
Current recommended best practice 2
Current way of performing relative imports
An example of what is done in existing major packages
A comparison to C++ (a language with the same basic issue)
#4600: a potential change, but not really a fix

With the general overview being that (a) the problem exists, (b) it has already been acknowledged, but (c) there are at present no good solutions.

Phew, that was a long post. Thank you to those that read it in its entirety.

PS: And since I didn’t comment on it earlier:

Thank you! It’s very flattering to be recognised “in the wild”.

PetrKryslUCSD · December 26, 2020, 3:54am

I appreciate your thoughts. However, the whole thing is a bit abstract.

A and B depend on some functionality in utils. If that is something of interest in both, it could become a module that both should “use” (I often wish the keyword was use, not using ;-)). Including is not a very nice approach for allowing access to common functionality.

I think a concrete “for example” would be helpful.

I for one think that your foray into Julia esoterics is very interesting. Keep going! And, as some others already said, welcome!

ChrisRackauckas · December 26, 2020, 5:09am

I tried to talk about an example in the repo but I failed to even find the code…

github.com

pytorch/pytorch/blob/main/torch/linalg/init.py

import sys

import torch
from torch._C import _add_docstr, _linalg  # type: ignore[attr-defined]

LinAlgError = torch._C._LinAlgError  # type: ignore[attr-defined]

Tensor = torch.Tensor

common_notes = {
    "experimental_warning": """This function is "experimental" and it may change in a future PyTorch release.""",
    "sync_note": "When inputs are on a CUDA device, this function synchronizes that device with the CPU.",
    "sync_note_ex": r"When the inputs are on a CUDA device, this function synchronizes only when :attr:`check_errors`\ `= True`.",
    "sync_note_has_ex": ("When inputs are on a CUDA device, this function synchronizes that device with the CPU. "
                         "For a version of this function that does not synchronize, see :func:`{}`.")
}


# Note: This not only adds doc strings for functions in the linalg namespace, but
# also connects the torch.linalg Python namespace to the torch._C._linalg builtins.

This file has been truncated. show original

What you go to the linalg module and there’s no source code there?

And the reason why the code is so hard to read is precisely because it uses this nonlinear go-to architecture. IMO, everything should have a clear top level so the code is linear and can be read like a book, while nonlinear reading should be helped by tools (in any language). The problem with PyTorch is it doesn’t read like a book: there is no table of contents telling you what comes after another. There is no flow. You have to already understand the code in order to understand it since new code can come in from anywhere. You pick a file and try reading it, and… follow hyperlinks until you think you understand things? Well if you go to torch.linalg you don’t even find the linear algebra so good luck! (hint: it’s all global as we will see later, violating the these module rules that were argued for in the first place).

The style in OrdinaryDiffEq is linear. Here is your table of contents:

github.com

SciML/OrdinaryDiffEq.jl/blob/master/src/OrdinaryDiffEq.jl#L1


      
          """
          $(DocStringExtensions.README)
          """
          module OrdinaryDiffEq
          
          if isdefined(Base, :Experimental) &&
             isdefined(Base.Experimental, Symbol("@max_methods"))
              @eval Base.Experimental.@max_methods 1
          end
          
          using DocStringExtensions

(Note, this holds for every Julia package!) It tells you exactly what comes in, the chapters in what order, and it also has the exports to tell you what will come next (it should use more import instead of using, but that’s a separate matter). You can read this start to finish in it’s intended order and nothing you don’t know will jump out at you. And actually, by design this has to be legible or else you get an error! So no go-to style of code design, instead you have one canonical way to understand the code. You could use other tools as an appendix to jump around, sure, but if you want to understand the logic you can always go back to the story.

There will always be people who prefer coding with go-tos, with a bunch of globals, and a bunch of dynamic scopes, but I think time has told us again and again that making things simple and making things constrained always is helpful sooner or later. And programming styles which make people want to just append everything to globals and import as * are (a) hard for people and (b) hard for tools.

So in in a simplified sense, I think this whole discussion is phrased incorrectly. It should be understood as, “here’s a way to making using go-to’s easier so that way as code flies in from left and right you can try and make sense of the random assortment of globals!”. But the real question to ask is, “have you tried making your code read linearly and reading code linearly?”. Because you skipped the chapters on the caches and then complained that we Game of Thrones’d the ending, and then from the Cliff Notes wrote a final essay saying that the characters were undeveloped.

Topic		Replies	Views
Implicitly loaded modules in the future? Internals & Design question , module , code-organization	281	16259	May 4, 2022
Julia Modules Internals & Design module , code-organization	20	5258	November 17, 2017
Organization of multiple modules in same package General Usage packages , modules , code-organization	18	5407	May 4, 2021
Proper way of organizing code into subpackages New to Julia packages , code-organization	48	8149	August 24, 2022
Dependencies of src files inside a package General Usage dependencies	10	2669	July 17, 2020

[ANN] PatModules.jl: a better module system for Julia

Related topics