State of machine learning in Julia

I think all of your comments were fair (although I don’t agree with everything, for example the need to opting into @view is because Julia tends to be safe by default and a view when slicing would not be so), but I don’t quite get this point:

Why is that a problem? A[1] accesses the element with index 1, A[1, :] access the row with index 1, I fail to see what’s the problem.

11 Likes

Thanks!

That’s safe-by-default for functionality but not safe-by-default for speed.

Pragmatically speaking, I find that taking views is very common, whilst making a copy is unusual.

Special syntax for implicitly re-striding an array just seems a little odd to me. I don’t think I’ve ever done it deliberately in either Python or Julia.

6 Likes

On the contrary I always get a feeling of unease when leaving out indices in numpy, although I’ll admit that I sometimes like how the code turns out. One thing I dislike about it in numpy is that it seems kind of arbitrary why it means the same as A[1, :] rather than A[:, 1]. If the answer is that it’s natural based on the (default) strides, notice that it then would be natural to have the opposite meaning in Julia, which probably would drive everybody nuts. (Obviously it’s not possible to change this without removing the linear indexing feature from Julia.)

9 Likes

I guess you’re not particularly eager to get stuck on this small side issue, but it seems like Julia has completely general syntax (with a very consistent rule about the dimensionality of the indices vs the output), while the Python syntax seems special and odd.

Otherwise, thanks for an interesting post.

7 Likes

For what it’s worth I’m fairly certain that that numpy syntax is internally consistent as well.

5 Likes

It is used it in a fair amount of Julia code. I’ve used it. It’s handy, and it’s clear what’s happening.

But, obviously, this is the least important of your criticisms.

I am trying to find a robust way to deploy Julia inside a python project. People have put a ton of work into Julia-Python interop in general. But, the parts closer to deployment are not there yet. It would be a great place for people to lend a hand. (This has been known for long time.) But, it has to compete with the other pressing issues you mentioned.

5 Likes

@ChrisRackauckas on the topic of machine learning and E-Graphs, how do you view E-Graphs in comparison to the work of the PyTorch developers on TorchDynamo. While E-Graph does seem to have wider-reaching goals, especially for SciML, than TorchDynamo, I would be intrigued to hear you see the two of them matching up?

Could E-Graph fulfill a similar role than TorchDynamo for the ML-Ecosystem in Julia?

2 Likes

It’s not the same or similar thing as the E-graph, but instead it’s similar to the interfaces the E-graphs are acting on. Maybe the easiest way to describe it by saying what is the same or similar. The Python bytecode is like “the Julia IR”. Of course, as an optimizing compiler, there isn’t a singular IR, instead there are stages: the untyped IR, the typed IR, and the LLVM IR. Cassette and IRTools, the tools on which Zygote.jl was built (some notable others are AutoPreallocation.jl, SparsityDetection.jl, etc.), are probably the most similar to TorchDynamo in that on untyped syntactic IR it is a tool that transforms to another untyped syntactic IR.

It turns out that for Julia this was a bad idea because (a) the meaning of code can depend (heavily) on types, and (b) this is before compiler optimizations, and so mixing compiler optimizations with automatic differentiation is impossible. Thus Julia v1.7 added an AbstractInterpreter interface to Julia Base itself for acting on typed IR, which is then used by packages like EscapeAnalysis.jl and Diffractor.jl to write compiler passes on typed IR. And of course LLVM IR has standard interpretation techniques along with Enzyme.jl which is an AD written on LLVM IR.

So TorchDynamo is probably most similar to Cassette/IRTools, but you could also say it’s like AbstractInterpreter in that it’s acting on “the true IR of Python”, where the true IR of Julia is typed when it has all of its information while in Python it is not. But this story is why Zygote has its compile-time issues, higher order AD issues, and why all of the tooling is moving to not just a new AD tool but an entirely different IR target and compiler tool stack (note this doesn’t imply that will happen to TorchDynamo, unless they start rewriting their AD to be source-to-source on Python bytecode, but there’s precedent of that in tangent which didn’t find a nice home). Note that these tools aren’t just for AD. For example, there are PRs to Julia’s Base which are automatically analyzing loops and removing repeated allocations of immutable arrays where they are written using the AbstractInterpreter compiler plugin interface.

So that still doesn’t answer how the heck E-graphs comes into the story because I haven’t described how you write a compiler pass. It doesn’t matter what level of IR you’re on, it’s basically just a function IR->IR. So where in their blog post they say “just add code here”

def custom_compiler(graph: torch.fx.GraphModule) → Callable:
    # do cool compiler optimizations here
    return graph.forward
    
with torchdynamo.optimize(custom_compiler):
    # any PyTorch code
    # custom_compiler() is called to optimize extracted fragments
    # should reach a fixed point where nothing new is compiled
    
# Optionally:
with torchdynamo.run():
    # any PyTorch code
    # previosly compiled artifacts are reused
    # provides a quiescence guarantee, without compiles

Well, that’s true in any of these systems, just like in macros. But if you’ve ever written a macro, you’ll know that walking expression graphs is a tedious process to get correct. Wouldn’t it be nice if compiler optimizations for mathematical ideas could be expressed mathematically, and the associated compiler pass could be generated? It turns out that all Symbolics tooling really is is just tooling that performs rewrites on some IR. So Symbolics.jl has an IR that uses SymbolicUtils.jl’s rewriters and MetaTheory.jl’s E-graphs to transform symbolic IR → symbolic IR, but what we have done is made those rewrite tools generic to the IR and boom now it’s a compiler optimization pass generator.

That means you can say define an E-graph that acts on Julia typed IR and spits out the typed IR with the desired simplifications described mathematically. This is what we mean by “democratization of writing compiler passes”: we are trying to use this to build a system so that people who want to add a new linear algebra simplification pass to the Julia typed IR do not need to learn all of the details of the AbstractInterpreter and Julia Typed IR definition, and instead just write a few mathematical equalities and boom it generates a compiler pass which then generates the transformed IR. So think of the E-graphs as replacing this requirement that someone writes a function like def custom_compiler(graph: torch.fx.GraphModule) → Callable: that digs through some expression graph. Instead you just write

Man, this came out longer than expected. But since it describes why Zygote is being replaced with Diffractor and Enzyme I guess it’s a useful description for many other reasons than the original question :sweat_smile:

16 Likes

Thanks for the shout-out on the CTPG paper! I whole-heartedly agree that the paper would not have been possible without Julia and DifferentialEquations.jl. It’s cool to see what can be done when you break out of the usual static-graph mode. Thanks and kudos to @ChrisRackauckas!

I work in ML, largely in Python, but I have a soft spot for Julia as well. I think @patrick-kidger’s response summarizes things very well. I’ll just chip in a few of my own experiences/thoughts:

What does ML even mean?

There are so many different types of models/problems/architectures these days that it’s worth pointing out that there’s a big difference between “conventional deep learning” – transformers, convnets, large models – and “other” more obscure models – differentiable physics, neural ODEs, implicit models, etc. So far I think Julia is doing better in the “other” category.

It’s worth noting that the requirements for “conventional” vs “other” can be drastically different. Everything from model parallelism to compute architecture to float32 vs float64.

Speed

Compilation speed is entirely irrelevant (cf. jax). What matters at the end of the day is iterations/second. Right now, JAX/XLA seem to have Julia beat in the “conventional large model” space since they have all kinds of optimizations for linear algebra, specific kernels, and TPUs. At this point just about every last drop of performance has been squeezed out of pytorch/TF/jax in the “conventional” large models space.

That being said, I am extremely bullish on the MetaTheory.jl line of work with e-graph based optimization. Ultimately I think this is a superior design than anything in the competition. But the devil will be in the details of making it production-ready esp. on GPUs/TPUs.

Correctness

Like @patrick-kidger, I have been bit by incorrect gradient bugs in Zygote/ReverseDiff.jl. This cost me weeks of my life and has thoroughly shaken my confidence in the entire Julia AD landscape. As a result I now avoid using Julia AD frameworks if I can. At minimum, I cross-check all of their results against JAX… at which point I might as well just use the JAX implementation. (Excited to check out Diffractor.jl when it’s ready though!)

In all my years of working with PyTorch/TF/JAX I have not once encountered an incorrect gradient bug.

Ecosystem and library scope

I really wish there was just something like JAX in Julia. Flux.jl is too high-level for me most of the time. Zygote is often too low-level. I like the idea of source-to-source AD though. Maybe we just need new frameworks on top of Zygote/Diffractor to spring up? I don’t know. I expect that solutions here will emerge naturally as more investment is made in ML in Julia and people bump into the limitations of existing tooling…

I’m optimistic for the future of ML in Julia. I really am. For me personally, it’s not ready for what I need it to do just yet. But I’m optimistic that this may change over time.

27 Likes

Thanks for the very thoughtful post Patrick, and nice to see you around. Some thoughts on the above:

I’d argue this applies to most non-science/numerics projects and ~20-30% of scientific python code, but there is a long, long tail of projects that use no linting or any kind of static analysis. These projects tend to make many of the same faux pas you mentioned.

I think this speaks to three things:

  1. The benefits of centralization in the Python ecosystem. A majority users doing data-y stuff can get away with the that top 20-30%.
  2. The relative size/resourcing of both ecosystems. Code quality may well be better for Julia and Python codebases of the same popularity, but if we look at relative in-ecosystem popularity then your point probably holds. I don’t have a good intuition on how much we should weight those two disparate perspectives. For example, the aforementioned exemplary Python projects have multiple magnitudes more engineering time/money/infra to work with, and trying to replicate their quality without those is rather unrealistic.
  3. A need to get more automated tooling running in the Julia ecosystem. I want to say DocumentLint has been used in CI, but it’s still primarily an IDE thing. JET.jl leaves basically every Python type checker in the dust, but its relative novelty means that adoption is still low.

This is possible at runtime with libraries like GitHub - invenia/NamedDims.jl: For working with dimensions of arrays by name and potentially statically with JET + named array libraries. The biggest challenge I see (one you’re likely familiar with developing torchtyping) is adoption + standardization. Guido has been leading an effort on the Python side, so seriously exploring avenues like https://twitter.com/KenoFischer/status/1407810981338796035 could be fruitful here.

6 Likes

Thanks @Samuel_Ainsworth and @patrick-kidger for your frank thoughts! It’s really important to get this sort of feedback.

How long ago were both of you getting incorrect gradients? Were these errors on recent versions of zygote? After the chainrules switch?

13 Likes

If you’ll allow me to start from the end first:

This is not well documented and ought to be so (PRs welcome if anyone is interested), but Flux is really an amalgamation of different sub libraries:

  1. An AD (Zygote)
  2. A set of ML kernels (NNlib)
  3. A module system (Functors.jl)
  4. A set of layers, optimizers and training loop helpers (Flux itself)

Using just #1 and #2 is equivalent to torch.nn.functional. Using 1, 2, and 3 gets you a JAX equivalent. The plan is to move optimizers out from #4 into a separate package (see Optimisers.jl) so that you can use it just like JAX users use Optax. This kind of shared infrastructure is already being exploited in the ecosystem: Knet uses NNlib (NNlib dev is a collaboration between Knet and Flux) and offers a “lower level” interface you may be interested in, while Avalon.jl uses NNlib + Functors for a more PyTorch-esque framework.

Now to the broader, more philosophical point. I also use 100% Python for my own work, and the dynamics/motivation there are very similar to what you’ve described. Though not worded particularly pleasantly, I think this HN comment summarizes the struggle well:

My impression from your comment is that you don’t care that much about “standard” ML users. As a “standard” ML user (pytorch/jax), and a potential Julia user in the future, this is not what I like to hear.

Now, there have been very some very good points made here and on different forums that trying to take the Python ML juggernaut on in its own territory is at best aspirational (E: after reading Chris’ response, the original more forceful “fools errand” would’ve been more appropriate :stuck_out_tongue: ). What I don’t think has happened is saying the “quiet part out loud” following the logical conclusion of that. Of course the Julia community is not a monolith and there will be divergent opinions on how to approach ecosystem development, but folks like the aforementioned HN commenter are looking for a clearer statement. That is, where do we fall between the two extremes of “novel architectures/approaches are the only way to go, if they do it well then we shouldn’t bother” to “Julia ML should be #1 on everything”? And depending on the vision, what are some concrete steps that can be taken to support it?

Edit: to make sure I’m not underselling or misrepresenting things, there are some great and very clear roadmaps for parts of the ML space already. SciML and advanced AD come to mind. The question above is about the complement: what should be put into the “don’t expect anything big here unless you’re willing to help develop or fund it” bucket?

6 Likes

+1 to calling it “conventional” ML (or some other name), since there is already an important programming language called Standard ML (meta-language) that Julia packages take features from.

1 Like

Yes, one thing to mention is that the Julia community is large and not a monolith and so there are many people developing these tools, all with their own reasons and aspirations. While there are some institutions that tend to have more of the developers for AD and ML libraries (specifically the Julia Lab and Julia Computing), those entities are large and not monoliths themselves. Even at the Julia Lab, I have no control over why people work on these problems, rather I just work with the students and research software engineers to guide them towards successful projects. Many people are doing it as ML for ML’s sake, and that’s fine.

But I think everyone should just be honest and clear as to some of the technical aspects and how they relate to the higher level decisions that have developed such large labs around this topic.

trying to take the Python ML juggernaut on in its own territory is at best aspirational

No, that’s an understatement. Let’s make it absolutely clear: there is nothing in the technical approach of differentiable programming that will make “conventional ML” faster. Period. A perfect Zygote or Diffractor will not make matrix multiplication kernels faster, it will not make convolutional kernels faster, and will not make faster Transformer kernels. For large “big data” conventional machine learning, calls to the kernels are on the order of tens to hundreds of seconds. The AD overhead of a slow AD like PyTorch or even just AutoGrad is in the miliseconds per operation. A source-to-source AD that cuts that down to close to zero is not getting even a 1% gain in those applications. Source-to-source AD is a much larger and harder project which trades the applicability to full dynamism and lower overhead (+ JIT compilation of all reverse paths) for a lot of added complexity. Conventional ML models like transformers do not use this dynamism. Those models do not have to worry about this overhead. The current AD work will not magically some day give you something that will be compelling to conventional ML users to make that pack up and switch from Python. If that was the purpose of those projects, then those projects would be an extremely dumb idea. Why build a brand new multi-million dollar stadium from your kid’s elementary school football team? It’s not a fit-for-purpose idea, and it will actually hold the Julia ecosystem back for a bit in this domain because of the added complexity.

Maybe having full language support will make some ergonomic gains, like it will integrate with the profiler and debuggers better than DSLs generally do, and if someone happens to write a model in the “wrong” way it could play nicer than say something like Jax where if you write something that isn’t functional and pure :man_shrugging: incorrect gradients can occur. But we’re talking minor gains at the end of the day for those applications.

But let’s dig even deeper. Zygote’s purpose was to not unroll loops so that the AD could JIT compile loopy code with small kernels. That’s a very nice improvement for domains that need loopy code with small kernels. You can expect some pretty good performance gains, and you should choose Zygote if that’s your domain. Conventional ML is not in that domain. :man_shrugging: sorry. This emerging whole SciML domain happened to fit that domain and that’s how it found a home there which launched the organization and such. With that lens, it should be no surprise that in conventional ML Julia did not capture the whole audience whereas in SciML it became a big chunk of the (still rather small) field. It’s not random, and it’s not just sweat and grit, there’s real technical reasons behind it that you shouldn’t just gloss over.

Diffractor.jl’s driving emphasis was a category theoretic formulation for higher order derivatives. That gives you some massive speedups if you’re calculating third or fourth derivatives. But in conventional ML, who’s doing that? People don’t take Hessians of neural networks, let alone anything higher. Yes, there will be some spillover effects for how this improves conventional ML cases because of changing the target towards typed IR (potential compile-time improvements, maintainability, etc.). But flipping the Diffractor switch won’t be a day where Flux is suddenly a whole lot better for conventional ML. The reason for this kind of tool is applications like physics-informed neural networks which routinely take 3rd order derivatives and above. That’s the kind of application that funded it (specifically for use in NeuralPDE.jl). That’s a growing field, enough so that the NVIDIA CEO keeps mentioning physics-informed neural networks, and that’s an area where this kind of tool will cause a substantially noticeable difference. But that’s not NLP or image processing with convnets and transformers. For those cases, Diffractor would be a very hard project to get little gains, it would make no sense. If the purpose of Diffractor was those domains, it would be a bad idea.

So let’s refocus a little. Let’s say your goal is to improve conventional ML. How would you do it? Here’s a few things that come to mind for me:

  1. You could focus a project on conventional ML researchers by making it easier to develop faster kernels. This would help people out of the “ML is stuck in a rut” problem where better ideas can be slower than worse ideas simply because of how much the standard kernels have been optimized. If you want to do this, you should develop an AD that is really good at differentiating compute kernels. Zygote and Diffractor are not the tools for this, Enzyme.jl is. See the paper for generating adjoints of GPU kernels as an example. Or you could develop tools like LoopVectorization.jl that are instead targeted to GPUs. KernelAbstractions.jl.
  2. You could focus a project on making it easier to capture more high level kernel fusions to optimize the kernel-centric code. That’s the e-graphs projects, and that’s what the folks at Google are doing with XLA. That’s what MLIR is aiming to do.
  3. You could focus a project on making it easier to do distributed multi-GPU training. The ergonomics here are still rather difficult, with with TensorFlow/XLA. Easy installation and running it on local compute clusters. DaggerFlux is probably the closest project we have to this other than XLA.jl
  4. You could focus on writing faster GPU kernels for specific tasks.
  5. You could make packages with experimental APIs to improve the ergonomics of conventional training workflows. Integrate some automation in there. Automatic MLops? ML libraries without implicit global parameter references?
  6. You could, instead of waiting for Zygote and Diffractor to be “complete”, skip ahead and do ML on small DSLs. DSLs will always be easier to optimize given their constrained nature. Yota.jl is a great example of this. It uses a tracer, Ghost.jl, to get a simpler IR and does some nice things on that.

Noticeably absent from that list are the current ADs and differentiable programming work. That will do almost nothing for the conventional ML domain except maybe, just maybe, a few ergonomic improvements when everything works out. There are much better projects to work on if conventional ML was the goal. But for me and large parts of the Julia Lab, conventional ML is not the goal, which is why there is so much work and publications in differentiable programming tools. Hopefully this line of reasoning makes it as clear as daylight.

44 Likes

ML in Julia has a bright future, and is currently very strong in certain areas. I am constantly impressed by the intelligence of those working in the Julia AD space. Everything is possible in Julia. In fact everything is trivial in Julia, if you are very clever. This is the current problem:

ML in Julia requires high existing knowledge or a lot of time searching/doing trial and error.

Previous answers have discussed the lack of technical limitations to moving on par with PyTorch/Jax for general deep learning, but there are other important factors that drive adoption: thorough documentation, blogs, useful error messages, stability, and a vague feeling of “trust”. It can be tempting to think that these things follow naturally from the technical possibilities, but they are often driven by that special type of contributor who prefers taking off the rough edges to adding new ones.

Getting the Julia ML ecosystem to the scale of PyTorch requires drawing in these type of contributions, and without the luxury of big tech support or resources. As others have said, it’s not the primary focus of many Julia developers (myself included).

On a personal level I am currently in the wild woods of developing novel differentiable algorithms in Julia. I love it, but it has come with a constant stream of cryptic errors, lack of features and incorrect gradients. Sometimes I long for the warm blanket that is PyTorch.

Julia should seek to become that warm blanket. It already has for general scientific programming.

22 Likes

I would like to to give my 2c about the topic.

First, I will speak from my point of view considering both typical Machine Learning, dominated by scikit-learn in Python, and Deep Learning, dominated by TensorFlow and PyTorch, and the more recent Flax.

  1. Where does ML in Julia shine today?

Well, it is is difficult question because it depends on the compared ecosystem (scikit-learn, TensorFlow, PyTorch, caret in R, …).

In my opinion, Julia shines in the fact that the packages/libraries does not have to implement too much, enforcing the compatibility between them. For instance, MLJFlux, or the fact that all of them could work not only with DataFrames but with other structures that implements Tables.jl interface.

It does not shines in velocity, because in my opinion, TensorFlow, PyTorch, Scikit-learn JAX, are more mature libraries and their implementations use C/C++ and GPU. However, it can be more flexible, because it is in the same language.

Also, it shines in the simplicity of the implementations, you can read Flux, for instance, and understand a lot of it.

  1. Julia ML ecosystem is currently inferior in features. For instance, in ML, for tackling imbalance categories, advances data transformations (like discretization, …), . Also, in MLJ the required time in packages implementing the models takes more than the implementation using Scikit-learn (but in that case the message errors are a lot worse). In DL, Flux is more general and it has a lot less features than TF/PyTorch (reading files, segmentation, preprocessing, …). There are some work, like FastAI, but it is still in work.

3 and 4. I have not strong information about the performance, there is not really good benchmarks (or at least I do not know them).

  1. I think it should be improve the documentation, and improve detected bugs or required features to be at the level of other packages.

  2. Because it can be a great alternative, that could be a lot better without putting a lot of resources in it.

  3. I usually used R more for data processing, and pandas in Python but now I use a lot more Julia. I have use for ML the framework in Julia MLJ, however, for imbalance, tuning, I usually use more Python. Nowadays, I use TF or PyTorch for DL, but I hope to be able to replicate the work in Julia, but until recently, with FastAI, there was a lot of preprocessing not implemented.

4 Likes

I completely agree with what @patrick-kidger and @jgreener64 said. Julia has indeed a huge potential for machine learning, but its current state is a little bit mixed. Personally, coming from climate science and wanting to use SciML as a tool for my research, I’m left with mixed feelings. Some developers/researchers have a super solid background on computer science, and/or can afford spending a lot of time doing dev work. For others, like me, this is only a part of my job, and we could use a little more user-friendliness. I understand that this is also a consequence of the novelty of many of these libraries and methods, but I’m often struggling to find the necessary information in the documentation, and errors are often cryptic and hard to debug.

More specifically, the main reason I’m sticking with Julia for SciML is because the DifferentialEquations.jl library is top notch. It works super well, and I haven’t found anything similar in Python. However, it’s the AD part that is becoming a true pain for my research. I recently started a similar open discussion about the state of differentiable physics in Julia, which also highlighted some of the current limitations (and strong points) of the AD ecosystem in Julia. Since I started working with Julia, I’ve had two bugs with Zygote which have slowed my work by several months. On a positive note, this has forced me to plunge into the code and learn a lot about the libraries I’m using. But I’m finding myself in a situation where this is becoming too much, and I need to spend a lot of time debugging code instead of doing climate research. Moreover, the documentation of both Zygote and Flux is pretty small, and I already found myself making a PR to add some extra comments to the Zygote documentation, because as a newcomer to the library I felt completely lost at the beginning.

I guess all this will be fixed with time, as new people join the community and the libraries become more mature. I still think Julia is the best choice for SciML, but more care should be taken into making these libraries (and their documentation) more user friendly. Otherwise I can totally understand that a large pool of potential users gets scared away. Just my two humble cents.

25 Likes

Out of curiosity, what were these bugs?

2 Likes

The first one was extremely simple, but it was super hard to debug. Basically Zygote couldn’t provide a gradient for the sqrt function, since I was applying it to a matrix with zeros in it. Here is a GitHub issue with more details. Irrespective of the technicalities, sqrt is such a common function that having a bug on it will surely impact a large amount of users.

The other (and current) one is more obscure, and still under investigation. For some reason Zygote is giving me gradients that are all zero (which should probably just error), while another AD library (ReverseDiff) is working. The problem is that I need Zygote for my model to be able to backpropagate with acceptable speeds. ReverseDiff is just too slow for my case.

These bugs are problematic for the average user because (1) Zygote is pretty hard to debug, and (2) they require very specific skills which only the library developers have. Luckily, everyone in the Julia community is extremely helpful and nice. But I really wish I could handle more things by myself just by reading a more complete documentation or by having more meaningful errors.

7 Likes

We rely heavily on Julia for our Differential Equations work. But these issues have driven us back to Python for some of the ML stuff we have started doing.

These things impact other libraries as well. Julia needs to start taking these issues seriously by proiding library authors with tools that address these issues at the language level.

16 Likes