State of machine learning in Julia

Yes, one thing to mention is that the Julia community is large and not a monolith and so there are many people developing these tools, all with their own reasons and aspirations. While there are some institutions that tend to have more of the developers for AD and ML libraries (specifically the Julia Lab and Julia Computing), those entities are large and not monoliths themselves. Even at the Julia Lab, I have no control over why people work on these problems, rather I just work with the students and research software engineers to guide them towards successful projects. Many people are doing it as ML for ML’s sake, and that’s fine.

But I think everyone should just be honest and clear as to some of the technical aspects and how they relate to the higher level decisions that have developed such large labs around this topic.

trying to take the Python ML juggernaut on in its own territory is at best aspirational

No, that’s an understatement. Let’s make it absolutely clear: there is nothing in the technical approach of differentiable programming that will make “conventional ML” faster. Period. A perfect Zygote or Diffractor will not make matrix multiplication kernels faster, it will not make convolutional kernels faster, and will not make faster Transformer kernels. For large “big data” conventional machine learning, calls to the kernels are on the order of tens to hundreds of seconds. The AD overhead of a slow AD like PyTorch or even just AutoGrad is in the miliseconds per operation. A source-to-source AD that cuts that down to close to zero is not getting even a 1% gain in those applications. Source-to-source AD is a much larger and harder project which trades the applicability to full dynamism and lower overhead (+ JIT compilation of all reverse paths) for a lot of added complexity. Conventional ML models like transformers do not use this dynamism. Those models do not have to worry about this overhead. The current AD work will not magically some day give you something that will be compelling to conventional ML users to make that pack up and switch from Python. If that was the purpose of those projects, then those projects would be an extremely dumb idea. Why build a brand new multi-million dollar stadium from your kid’s elementary school football team? It’s not a fit-for-purpose idea, and it will actually hold the Julia ecosystem back for a bit in this domain because of the added complexity.

Maybe having full language support will make some ergonomic gains, like it will integrate with the profiler and debuggers better than DSLs generally do, and if someone happens to write a model in the “wrong” way it could play nicer than say something like Jax where if you write something that isn’t functional and pure :man_shrugging: incorrect gradients can occur. But we’re talking minor gains at the end of the day for those applications.

But let’s dig even deeper. Zygote’s purpose was to not unroll loops so that the AD could JIT compile loopy code with small kernels. That’s a very nice improvement for domains that need loopy code with small kernels. You can expect some pretty good performance gains, and you should choose Zygote if that’s your domain. Conventional ML is not in that domain. :man_shrugging: sorry. This emerging whole SciML domain happened to fit that domain and that’s how it found a home there which launched the organization and such. With that lens, it should be no surprise that in conventional ML Julia did not capture the whole audience whereas in SciML it became a big chunk of the (still rather small) field. It’s not random, and it’s not just sweat and grit, there’s real technical reasons behind it that you shouldn’t just gloss over.

Diffractor.jl’s driving emphasis was a category theoretic formulation for higher order derivatives. That gives you some massive speedups if you’re calculating third or fourth derivatives. But in conventional ML, who’s doing that? People don’t take Hessians of neural networks, let alone anything higher. Yes, there will be some spillover effects for how this improves conventional ML cases because of changing the target towards typed IR (potential compile-time improvements, maintainability, etc.). But flipping the Diffractor switch won’t be a day where Flux is suddenly a whole lot better for conventional ML. The reason for this kind of tool is applications like physics-informed neural networks which routinely take 3rd order derivatives and above. That’s the kind of application that funded it (specifically for use in NeuralPDE.jl). That’s a growing field, enough so that the NVIDIA CEO keeps mentioning physics-informed neural networks, and that’s an area where this kind of tool will cause a substantially noticeable difference. But that’s not NLP or image processing with convnets and transformers. For those cases, Diffractor would be a very hard project to get little gains, it would make no sense. If the purpose of Diffractor was those domains, it would be a bad idea.

So let’s refocus a little. Let’s say your goal is to improve conventional ML. How would you do it? Here’s a few things that come to mind for me:

  1. You could focus a project on conventional ML researchers by making it easier to develop faster kernels. This would help people out of the “ML is stuck in a rut” problem where better ideas can be slower than worse ideas simply because of how much the standard kernels have been optimized. If you want to do this, you should develop an AD that is really good at differentiating compute kernels. Zygote and Diffractor are not the tools for this, Enzyme.jl is. See the paper for generating adjoints of GPU kernels as an example. Or you could develop tools like LoopVectorization.jl that are instead targeted to GPUs. KernelAbstractions.jl.
  2. You could focus a project on making it easier to capture more high level kernel fusions to optimize the kernel-centric code. That’s the e-graphs projects, and that’s what the folks at Google are doing with XLA. That’s what MLIR is aiming to do.
  3. You could focus a project on making it easier to do distributed multi-GPU training. The ergonomics here are still rather difficult, with with TensorFlow/XLA. Easy installation and running it on local compute clusters. DaggerFlux is probably the closest project we have to this other than XLA.jl
  4. You could focus on writing faster GPU kernels for specific tasks.
  5. You could make packages with experimental APIs to improve the ergonomics of conventional training workflows. Integrate some automation in there. Automatic MLops? ML libraries without implicit global parameter references?
  6. You could, instead of waiting for Zygote and Diffractor to be “complete”, skip ahead and do ML on small DSLs. DSLs will always be easier to optimize given their constrained nature. Yota.jl is a great example of this. It uses a tracer, Ghost.jl, to get a simpler IR and does some nice things on that.

Noticeably absent from that list are the current ADs and differentiable programming work. That will do almost nothing for the conventional ML domain except maybe, just maybe, a few ergonomic improvements when everything works out. There are much better projects to work on if conventional ML was the goal. But for me and large parts of the Julia Lab, conventional ML is not the goal, which is why there is so much work and publications in differentiable programming tools. Hopefully this line of reasoning makes it as clear as daylight.

44 Likes