State of machine learning in Julia

As someone mostly busy with a day job that restricts OSS dev I feel like I don’t have much standing to express what I do below, and it might not be well received. On the other hand, I also really love Julia and its amazing community. I am personally invested in seeing it succeed and thus would like to explore what I perceive as another important facet to some of the issues raised. I’m open to pushback or even being ignored.

Julia’s genius was that if you restrict dynamism a bit while having really clever design, you can keep the vast majority of what you like about python, get a lot of extra composability and have something that’s much easier on the compiler. Almost a pareto optimal situation.

This balance was developed in an era before deep learning, autodiff and research into next generation static FP programming languages, which both preserve more static information and improve usability and ergonomics. Dynamically typed has been stripped to its essence, and it’s very debatable whether it’s inherently better at all, much less with the trade offs that it comes with in trying to do full language differentiable programming. A time horizon of years has been thrown around regarding when Julia is fast enough in general at GPU + dp for fast at prosaic usecases to fall out of that. That might be optimistic and we also see that there correctness issues, which is even worse.

Now, The demands on a language are higher and the tradeoff space is different. We have dependent typing, type inference, effect systems, static languages with REPLs. We can have a language that encodes more static information, with a net improvement in usability for modern applications.

A language like Dex exhibits these. I’m concerned that while Julia chases asymptotically approaching the promise of full language, correct, fast dp, a language with better tradeoffs like Dex will get there first, and while preventing lots of bugs in performance and correctness that Julia hasn’t even begun to address.

Here’s an excerpt from the dex paper:

We feel that Dex gained a great deal as a language from being co-designed with its automatic differentiation system. AD is something like a very demanding user of the language—it is always trying
to write programs the compiler developers did not anticipate, and always producing compelling
bug reports or feature requests when those programs do not work or are slower than they should
be. In this section, we discuss a few specific subtleties in the design of Dex’s AD, and the effects
AD has had on the rest of the language and compiler

This perfectly describes the situation for the last five years.

I’ll repost something I said on slack:

Certainly the devs are spread thin, but debugging low level IRs and fragile compiler heuristics are always going to be more difficult than relying on static performance guarantees, as dex demonstrates for its reverse mode gradient function in the first image.

Brian said it best:

“In fact, Julia’s broadcasting is a good example of how a simpler, more composable interface can do the job of more complex, numerous and edge-case prone specialized compiler machinery. Making your loss function run fast and not allocate a ton without plugging in something on the level of XLA is still a tall order, and $DEITY help you if said function hits a non-optimized path in XLA too. Even in Julia land, we still don’t have a stable solution for fused broadcasts in a GPU-friendly reverse mode AD.”

Unless some sort of static system is introduced, I fear we’re always going to be chasing down not-so-corner cases in performance (computational complexity, memory and parallelism) and correctness (dex can guarantee these). I know it’s against Julia’s ethos, but it seems important for ml, ad and the accelerator world. Otherwise I fear there’s a structural constraint/ this will take an inordinate amount of work, which is what you imply when you say it will take years to get arbitrary code fast.

Maybe this semi-static plan is the answer? How can we enforce the right semantics, have them propagate across functions? If we just say "well it’s in the code and sometimes it will all just fit into place/ infer and sometimes it won’t and you’ll get any.

that problem explodes when we’re talking about AD, composability and gpu codegen. Like playing whack a mole. It’s not longer just about inlining and unboxing. We have correctness, computational complexity, accelerator codegen, parallelism etc to worry about now.

We’re in 2022 and Conv still isn’t type stable : Flux.Conv type instability · Issue #1178 · FluxML/Flux.jl · GitHub
And it’s not a trivial fix. (edited)

Dex is still a research project, and it certainly hasn’t proven that it could provide a solution to everything, but it feels like it’s going in the right direction. Even in a hypothetical situation where the e-graph passes work well, in dex everything is typed index sets, effects and loops, so it doesn’t need as many compiler heuristics.

It hasn’t proven itself capable yet of taking these inlined loops and generating fast accelerator code, but the MLIR project has plenty of people working on that and I think it should get there with time.

What about Julia’s strength? Well if you read this Rethink overloading · Issue #671 · google-research/dex-lang · GitHub one of the dex devs acknowledge that Julia secret composability sauce is the combination of subtyping and pervasive multiple dispatch. It remains to be seen whether Dex can provide a similar effect. If Julia never gets fast and correct enough, I don’t think the marginal benefit will matter, as ML languages aren’t slouches there either.

So, where does that leave Julia for the future. I think to break out of its niche use case, it has to be able to somehow make all this development intrinsically more tenable, especially if it lacks FAANG resources. It’s certainly not possible to dramatically shift the semantics of the language, but people are exploring ways to have opt in static features ( https://twitter.com/KenoFischer/status/1407810981338796035 and https://news.ycombinator.com/item?id=26136212 and JET.jl, and Jan Vitek’s group).

I wonder if this could all come together into a coherent story so we could have a language for ML that’s correct, ergonomic (including good error messages) and fast enough for both DP and prosaic ML, with type safety AND the composability of Julia. Why should someone prefer that over python if they are just stacking layers of fat linalg? Well that’s where DL is now, which could change. As Chris mentioned, there is arguably a sapir-whorf effect keeping it there.

Also, all the non ml stuff like data handling, munging, viz etc is much much more pleasant in Julia. (Pandas makes me :frowning: )

I’m rooting for Julia to become more prevalent in general ML/DS , but the situation is different now.

Not sure how hopeful I am at this point. I’ve seen very little acknowledgement of these structural technical issues. Is that a Kuhnian like ossification, which is an understandable part of normal human epistemology, or am I just totally wrong here? I’m very open to and would prefer the latter.

(Please forgive the somewhat ad hoc and not ideally organized/proofread response. I fired this off very quickly and the heat is broken )

25 Likes