State of machine learning in Julia

If you’ll allow me to start from the end first:

This is not well documented and ought to be so (PRs welcome if anyone is interested), but Flux is really an amalgamation of different sub libraries:

  1. An AD (Zygote)
  2. A set of ML kernels (NNlib)
  3. A module system (Functors.jl)
  4. A set of layers, optimizers and training loop helpers (Flux itself)

Using just #1 and #2 is equivalent to torch.nn.functional. Using 1, 2, and 3 gets you a JAX equivalent. The plan is to move optimizers out from #4 into a separate package (see Optimisers.jl) so that you can use it just like JAX users use Optax. This kind of shared infrastructure is already being exploited in the ecosystem: Knet uses NNlib (NNlib dev is a collaboration between Knet and Flux) and offers a “lower level” interface you may be interested in, while Avalon.jl uses NNlib + Functors for a more PyTorch-esque framework.

Now to the broader, more philosophical point. I also use 100% Python for my own work, and the dynamics/motivation there are very similar to what you’ve described. Though not worded particularly pleasantly, I think this HN comment summarizes the struggle well:

My impression from your comment is that you don’t care that much about “standard” ML users. As a “standard” ML user (pytorch/jax), and a potential Julia user in the future, this is not what I like to hear.

Now, there have been very some very good points made here and on different forums that trying to take the Python ML juggernaut on in its own territory is at best aspirational (E: after reading Chris’ response, the original more forceful “fools errand” would’ve been more appropriate :stuck_out_tongue: ). What I don’t think has happened is saying the “quiet part out loud” following the logical conclusion of that. Of course the Julia community is not a monolith and there will be divergent opinions on how to approach ecosystem development, but folks like the aforementioned HN commenter are looking for a clearer statement. That is, where do we fall between the two extremes of “novel architectures/approaches are the only way to go, if they do it well then we shouldn’t bother” to “Julia ML should be #1 on everything”? And depending on the vision, what are some concrete steps that can be taken to support it?

Edit: to make sure I’m not underselling or misrepresenting things, there are some great and very clear roadmaps for parts of the ML space already. SciML and advanced AD come to mind. The question above is about the complement: what should be put into the “don’t expect anything big here unless you’re willing to help develop or fund it” bucket?

5 Likes

+1 to calling it “conventional” ML (or some other name), since there is already an important programming language called Standard ML (meta-language) that Julia packages take features from.

Yes, one thing to mention is that the Julia community is large and not a monolith and so there are many people developing these tools, all with their own reasons and aspirations. While there are some institutions that tend to have more of the developers for AD and ML libraries (specifically the Julia Lab and Julia Computing), those entities are large and not monoliths themselves. Even at the Julia Lab, I have no control over why people work on these problems, rather I just work with the students and research software engineers to guide them towards successful projects. Many people are doing it as ML for ML’s sake, and that’s fine.

But I think everyone should just be honest and clear as to some of the technical aspects and how they relate to the higher level decisions that have developed such large labs around this topic.

trying to take the Python ML juggernaut on in its own territory is at best aspirational

No, that’s an understatement. Let’s make it absolutely clear: there is nothing in the technical approach of differentiable programming that will make “conventional ML” faster. Period. A perfect Zygote or Diffractor will not make matrix multiplication kernels faster, it will not make convolutional kernels faster, and will not make faster Transformer kernels. For large “big data” conventional machine learning, calls to the kernels are on the order of tens to hundreds of seconds. The AD overhead of a slow AD like PyTorch or even just AutoGrad is in the miliseconds per operation. A source-to-source AD that cuts that down to close to zero is not getting even a 1% gain in those applications. Source-to-source AD is a much larger and harder project which trades the applicability to full dynamism and lower overhead (+ JIT compilation of all reverse paths) for a lot of added complexity. Conventional ML models like transformers do not use this dynamism. Those models do not have to worry about this overhead. The current AD work will not magically some day give you something that will be compelling to conventional ML users to make that pack up and switch from Python. If that was the purpose of those projects, then those projects would be an extremely dumb idea. Why build a brand new multi-million dollar stadium from your kid’s elementary school football team? It’s not a fit-for-purpose idea, and it will actually hold the Julia ecosystem back for a bit in this domain because of the added complexity.

Maybe having full language support will make some ergonomic gains, like it will integrate with the profiler and debuggers better than DSLs generally do, and if someone happens to write a model in the “wrong” way it could play nicer than say something like Jax where if you write something that isn’t functional and pure :man_shrugging: incorrect gradients can occur. But we’re talking minor gains at the end of the day for those applications.

But let’s dig even deeper. Zygote’s purpose was to not unroll loops so that the AD could JIT compile loopy code with small kernels. That’s a very nice improvement for domains that need loopy code with small kernels. You can expect some pretty good performance gains, and you should choose Zygote if that’s your domain. Conventional ML is not in that domain. :man_shrugging: sorry. This emerging whole SciML domain happened to fit that domain and that’s how it found a home there which launched the organization and such. With that lens, it should be no surprise that in conventional ML Julia did not capture the whole audience whereas in SciML it became a big chunk of the (still rather small) field. It’s not random, and it’s not just sweat and grit, there’s real technical reasons behind it that you shouldn’t just gloss over.

Diffractor.jl’s driving emphasis was a category theoretic formulation for higher order derivatives. That gives you some massive speedups if you’re calculating third or fourth derivatives. But in conventional ML, who’s doing that? People don’t take Hessians of neural networks, let alone anything higher. Yes, there will be some spillover effects for how this improves conventional ML cases because of changing the target towards typed IR (potential compile-time improvements, maintainability, etc.). But flipping the Diffractor switch won’t be a day where Flux is suddenly a whole lot better for conventional ML. The reason for this kind of tool is applications like physics-informed neural networks which routinely take 3rd order derivatives and above. That’s the kind of application that funded it (specifically for use in NeuralPDE.jl). That’s a growing field, enough so that the NVIDIA CEO keeps mentioning physics-informed neural networks, and that’s an area where this kind of tool will cause a substantially noticeable difference. But that’s not NLP or image processing with convnets and transformers. For those cases, Diffractor would be a very hard project to get little gains, it would make no sense. If the purpose of Diffractor was those domains, it would be a bad idea.

So let’s refocus a little. Let’s say your goal is to improve conventional ML. How would you do it? Here’s a few things that come to mind for me:

  1. You could focus a project on conventional ML researchers by making it easier to develop faster kernels. This would help people out of the “ML is stuck in a rut” problem where better ideas can be slower than worse ideas simply because of how much the standard kernels have been optimized. If you want to do this, you should develop an AD that is really good at differentiating compute kernels. Zygote and Diffractor are not the tools for this, Enzyme.jl is. See the paper for generating adjoints of GPU kernels as an example. Or you could develop tools like LoopVectorization.jl that are instead targeted to GPUs. KernelAbstractions.jl.
  2. You could focus a project on making it easier to capture more high level kernel fusions to optimize the kernel-centric code. That’s the e-graphs projects, and that’s what the folks at Google are doing with XLA. That’s what MLIR is aiming to do.
  3. You could focus a project on making it easier to do distributed multi-GPU training. The ergonomics here are still rather difficult, with with TensorFlow/XLA. Easy installation and running it on local compute clusters. DaggerFlux is probably the closest project we have to this other than XLA.jl
  4. You could focus on writing faster GPU kernels for specific tasks.
  5. You could make packages with experimental APIs to improve the ergonomics of conventional training workflows. Integrate some automation in there. Automatic MLops? ML libraries without implicit global parameter references?
  6. You could, instead of waiting for Zygote and Diffractor to be “complete”, skip ahead and do ML on small DSLs. DSLs will always be easier to optimize given their constrained nature. Yota.jl is a great example of this. It uses a tracer, Ghost.jl, to get a simpler IR and does some nice things on that.

Noticeably absent from that list are the current ADs and differentiable programming work. That will do almost nothing for the conventional ML domain except maybe, just maybe, a few ergonomic improvements when everything works out. There are much better projects to work on if conventional ML was the goal. But for me and large parts of the Julia Lab, conventional ML is not the goal, which is why there is so much work and publications in differentiable programming tools. Hopefully this line of reasoning makes it as clear as daylight.

38 Likes

ML in Julia has a bright future, and is currently very strong in certain areas. I am constantly impressed by the intelligence of those working in the Julia AD space. Everything is possible in Julia. In fact everything is trivial in Julia, if you are very clever. This is the current problem:

ML in Julia requires high existing knowledge or a lot of time searching/doing trial and error.

Previous answers have discussed the lack of technical limitations to moving on par with PyTorch/Jax for general deep learning, but there are other important factors that drive adoption: thorough documentation, blogs, useful error messages, stability, and a vague feeling of “trust”. It can be tempting to think that these things follow naturally from the technical possibilities, but they are often driven by that special type of contributor who prefers taking off the rough edges to adding new ones.

Getting the Julia ML ecosystem to the scale of PyTorch requires drawing in these type of contributions, and without the luxury of big tech support or resources. As others have said, it’s not the primary focus of many Julia developers (myself included).

On a personal level I am currently in the wild woods of developing novel differentiable algorithms in Julia. I love it, but it has come with a constant stream of cryptic errors, lack of features and incorrect gradients. Sometimes I long for the warm blanket that is PyTorch.

Julia should seek to become that warm blanket. It already has for general scientific programming.

22 Likes

I would like to to give my 2c about the topic.

First, I will speak from my point of view considering both typical Machine Learning, dominated by scikit-learn in Python, and Deep Learning, dominated by TensorFlow and PyTorch, and the more recent Flax.

  1. Where does ML in Julia shine today?

Well, it is is difficult question because it depends on the compared ecosystem (scikit-learn, TensorFlow, PyTorch, caret in R, …).

In my opinion, Julia shines in the fact that the packages/libraries does not have to implement too much, enforcing the compatibility between them. For instance, MLJFlux, or the fact that all of them could work not only with DataFrames but with other structures that implements Tables.jl interface.

It does not shines in velocity, because in my opinion, TensorFlow, PyTorch, Scikit-learn JAX, are more mature libraries and their implementations use C/C++ and GPU. However, it can be more flexible, because it is in the same language.

Also, it shines in the simplicity of the implementations, you can read Flux, for instance, and understand a lot of it.

  1. Julia ML ecosystem is currently inferior in features. For instance, in ML, for tackling imbalance categories, advances data transformations (like discretization, …), . Also, in MLJ the required time in packages implementing the models takes more than the implementation using Scikit-learn (but in that case the message errors are a lot worse). In DL, Flux is more general and it has a lot less features than TF/PyTorch (reading files, segmentation, preprocessing, …). There are some work, like FastAI, but it is still in work.

3 and 4. I have not strong information about the performance, there is not really good benchmarks (or at least I do not know them).

  1. I think it should be improve the documentation, and improve detected bugs or required features to be at the level of other packages.

  2. Because it can be a great alternative, that could be a lot better without putting a lot of resources in it.

  3. I usually used R more for data processing, and pandas in Python but now I use a lot more Julia. I have use for ML the framework in Julia MLJ, however, for imbalance, tuning, I usually use more Python. Nowadays, I use TF or PyTorch for DL, but I hope to be able to replicate the work in Julia, but until recently, with FastAI, there was a lot of preprocessing not implemented.

4 Likes

I completely agree with what @patrick-kidger and @jgreener64 said. Julia has indeed a huge potential for machine learning, but its current state is a little bit mixed. Personally, coming from climate science and wanting to use SciML as a tool for my research, I’m left with mixed feelings. Some developers/researchers have a super solid background on computer science, and/or can afford spending a lot of time doing dev work. For others, like me, this is only a part of my job, and we could use a little more user-friendliness. I understand that this is also a consequence of the novelty of many of these libraries and methods, but I’m often struggling to find the necessary information in the documentation, and errors are often cryptic and hard to debug.

More specifically, the main reason I’m sticking with Julia for SciML is because the DifferentialEquations.jl library is top notch. It works super well, and I haven’t found anything similar in Python. However, it’s the AD part that is becoming a true pain for my research. I recently started a similar open discussion about the state of differentiable physics in Julia, which also highlighted some of the current limitations (and strong points) of the AD ecosystem in Julia. Since I started working with Julia, I’ve had two bugs with Zygote which have slowed my work by several months. On a positive note, this has forced me to plunge into the code and learn a lot about the libraries I’m using. But I’m finding myself in a situation where this is becoming too much, and I need to spend a lot of time debugging code instead of doing climate research. Moreover, the documentation of both Zygote and Flux is pretty small, and I already found myself making a PR to add some extra comments to the Zygote documentation, because as a newcomer to the library I felt completely lost at the beginning.

I guess all this will be fixed with time, as new people join the community and the libraries become more mature. I still think Julia is the best choice for SciML, but more care should be taken into making these libraries (and their documentation) more user friendly. Otherwise I can totally understand that a large pool of potential users gets scared away. Just my two humble cents.

19 Likes

Out of curiosity, what were these bugs?

1 Like

The first one was extremely simple, but it was super hard to debug. Basically Zygote couldn’t provide a gradient for the sqrt function, since I was applying it to a matrix with zeros in it. Here is a GitHub issue with more details. Irrespective of the technicalities, sqrt is such a common function that having a bug on it will surely impact a large amount of users.

The other (and current) one is more obscure, and still under investigation. For some reason Zygote is giving me gradients that are all zero (which should probably just error), while another AD library (ReverseDiff) is working. The problem is that I need Zygote for my model to be able to backpropagate with acceptable speeds. ReverseDiff is just too slow for my case.

These bugs are problematic for the average user because (1) Zygote is pretty hard to debug, and (2) they require very specific skills which only the library developers have. Luckily, everyone in the Julia community is extremely helpful and nice. But I really wish I could handle more things by myself just by reading a more complete documentation or by having more meaningful errors.

4 Likes

We rely heavily on Julia for our Differential Equations work. But these issues have driven us back to Python for some of the ML stuff we have started doing.

These things impact other libraries as well. Julia needs to start taking these issues seriously by proiding library authors with tools that address these issues at the language level.

11 Likes

As someone mostly busy with a day job that restricts OSS dev I feel like I don’t have much standing to express what I do below, and it might not be well received. On the other hand, I also really love Julia and its amazing community. I am personally invested in seeing it succeed and thus would like to explore what I perceive as another important facet to some of the issues raised. I’m open to pushback or even being ignored.

Julia’s genius was that if you restrict dynamism a bit while having really clever design, you can keep the vast majority of what you like about python, get a lot of extra composability and have something that’s much easier on the compiler. Almost a pareto optimal situation.

This balance was developed in an era before deep learning, autodiff and research into next generation static FP programming languages, which both preserve more static information and improve usability and ergonomics. Dynamically typed has been stripped to its essence, and it’s very debatable whether it’s inherently better at all, much less with the trade offs that it comes with in trying to do full language differentiable programming. A time horizon of years has been thrown around regarding when Julia is fast enough in general at GPU + dp for fast at prosaic usecases to fall out of that. That might be optimistic and we also see that there correctness issues, which is even worse.

Now, The demands on a language are higher and the tradeoff space is different. We have dependent typing, type inference, effect systems, static languages with REPLs. We can have a language that encodes more static information, with a net improvement in usability for modern applications.

A language like Dex exhibits these. I’m concerned that while Julia chases asymptotically approaching the promise of full language, correct, fast dp, a language with better tradeoffs like Dex will get there first, and while preventing lots of bugs in performance and correctness that Julia hasn’t even begun to address.

Here’s an excerpt from the dex paper:

We feel that Dex gained a great deal as a language from being co-designed with its automatic differentiation system. AD is something like a very demanding user of the language—it is always trying
to write programs the compiler developers did not anticipate, and always producing compelling
bug reports or feature requests when those programs do not work or are slower than they should
be. In this section, we discuss a few specific subtleties in the design of Dex’s AD, and the effects
AD has had on the rest of the language and compiler

This perfectly describes the situation for the last five years.

I’ll repost something I said on slack:

Certainly the devs are spread thin, but debugging low level IRs and fragile compiler heuristics are always going to be more difficult than relying on static performance guarantees, as dex demonstrates for its reverse mode gradient function in the first image.

Brian said it best:

“In fact, Julia’s broadcasting is a good example of how a simpler, more composable interface can do the job of more complex, numerous and edge-case prone specialized compiler machinery. Making your loss function run fast and not allocate a ton without plugging in something on the level of XLA is still a tall order, and $DEITY help you if said function hits a non-optimized path in XLA too. Even in Julia land, we still don’t have a stable solution for fused broadcasts in a GPU-friendly reverse mode AD.”

Unless some sort of static system is introduced, I fear we’re always going to be chasing down not-so-corner cases in performance (computational complexity, memory and parallelism) and correctness (dex can guarantee these). I know it’s against Julia’s ethos, but it seems important for ml, ad and the accelerator world. Otherwise I fear there’s a structural constraint/ this will take an inordinate amount of work, which is what you imply when you say it will take years to get arbitrary code fast.

Maybe this semi-static plan is the answer? How can we enforce the right semantics, have them propagate across functions? If we just say "well it’s in the code and sometimes it will all just fit into place/ infer and sometimes it won’t and you’ll get any.

that problem explodes when we’re talking about AD, composability and gpu codegen. Like playing whack a mole. It’s not longer just about inlining and unboxing. We have correctness, computational complexity, accelerator codegen, parallelism etc to worry about now.

We’re in 2022 and Conv still isn’t type stable : Flux.Conv type instability · Issue #1178 · FluxML/Flux.jl · GitHub
And it’s not a trivial fix. (edited)

Dex is still a research project, and it certainly hasn’t proven that it could provide a solution to everything, but it feels like it’s going in the right direction. Even in a hypothetical situation where the e-graph passes work well, in dex everything is typed index sets, effects and loops, so it doesn’t need as many compiler heuristics.

It hasn’t proven itself capable yet of taking these inlined loops and generating fast accelerator code, but the MLIR project has plenty of people working on that and I think it should get there with time.

What about Julia’s strength? Well if you read this Rethink overloading · Issue #671 · google-research/dex-lang · GitHub one of the dex devs acknowledge that Julia secret composability sauce is the combination of subtyping and pervasive multiple dispatch. It remains to be seen whether Dex can provide a similar effect. If Julia never gets fast and correct enough, I don’t think the marginal benefit will matter, as ML languages aren’t slouches there either.

So, where does that leave Julia for the future. I think to break out of its niche use case, it has to be able to somehow make all this development intrinsically more tenable, especially if it lacks FAANG resources. It’s certainly not possible to dramatically shift the semantics of the language, but people are exploring ways to have opt in static features ( https://twitter.com/KenoFischer/status/1407810981338796035 and This package basically exposes information that the standard Julia compiler alre... | Hacker News and JET.jl, and Jan Vitek’s group).

I wonder if this could all come together into a coherent story so we could have a language for ML that’s correct, ergonomic (including good error messages) and fast enough for both DP and prosaic ML, with type safety AND the composability of Julia. Why should someone prefer that over python if they are just stacking layers of fat linalg? Well that’s where DL is now, which could change. As Chris mentioned, there is arguably a sapir-whorf effect keeping it there.

Also, all the non ml stuff like data handling, munging, viz etc is much much more pleasant in Julia. (Pandas makes me :frowning: )

I’m rooting for Julia to become more prevalent in general ML/DS , but the situation is different now.

Not sure how hopeful I am at this point. I’ve seen very little acknowledgement of these structural technical issues. Is that a Kuhnian like ossification, which is an understandable part of normal human epistemology, or am I just totally wrong here? I’m very open to and would prefer the latter.

(Please forgive the somewhat ad hoc and not ideally organized/proofread response. I fired this off very quickly and the heat is broken )

20 Likes

I think it is important to note that we can already use explicit parameters with Flux/ Zygote/ Optimisers.jl. Simply put, use the model as an object to be differentiated. This will return back a NamedTuple which can be consumed by the optimisers in Optimisers.jl. I have been blocked on making them standard in Flux for some time now, but maybe we just take the jump, merge whatever needs to be merged to make it happen. I am certain that any issues with optimisers.jl that may come up (Inplace updates come to mind) can be solved with existing PRs to bridge that gap.

Re Conv: I’m pretty sure this is a regression, and something that will need to be addressed again in the future. The design hasn’t really been changed in the PR either, but making it easier on the compiler. Note that we are mostly tracking improvements in compile times there

1 Like

I think the difference in opinions shared here reflect the larger difference in mindset. @ChrisRackauckas reflects the scientific/research mindset and @patrick-kidger reflects the engineering mindset. I think where Julia really shine is the scientific and research applications and python shines in the engineering side of things. Even when many say Julia doesn’t have mature packages in certain ML domains, they are missing the existence of Julia APIs for python packages and PyCall library. Maybe we need to popularize the PyCall based libraries a little bit more to show that they are not missing their favorite mature python libraries in Julia. Moreover, as far as I understand, the Julian philosophy is not to reinvent everything that has been already developed in other languages but to design and develop tools that are nonexistent/design tools for the future.

1 Like

Well, that’s part of the whack-a-mole dynamic I mentioned. Here it affects mostly compile times, (probably because a static block of code is guaranteed in cudnn, but I haven’t specifically checked), but that’s just an incidental benefit of this codepath which is a best case scenario.

My broader point is that this is not really the “fault” (I don’t like that word because it can connote a moral valence which doesn’t exist here), of the flux devs, but is something to be expected with the language design and problem domain.

Further regarding the benefits of dynamism in Julia, I was very kindly pointed to the following excerpts from Jeff’s thesis:

2 Likes

Can not agree more. I wasted a lot of time debugging Zygote for using it in my own research and also for recommending it to my friends (since I feel obligated to debug code for them). I did not have a similar experience when I was using PyTorch. One of the main reasons why Zygote is so unreliable is Zygote uses generic rules, and the complex numbers does not error me correctly when the AD rule is wrong. Now I do not believe generic autodiff can provide reliable gradients. Autodiff rules must be concrete.

10 Likes

A lot of this discussion is enlightening to read. I don’t have much experience as a deep user of the Julia ML stack, more as an observer. I’ve used a lot more of the python stack because of external requirements. It is interesting to see how things develop here.

One thing I would recommend to anyone fretting about the julia ML/DL space is to listen to the PyTorch Dev podcast. A good example episode is the one about meta-tensors and structured kernels. I would recommend this because:

  1. The episodes are ~15 minutes long, technical but not too complex, and relatively entertaining (Edward is good at discussing his thought process and providing color)
  2. It shows the engineering effort and considerations that have gone into PyTorch… it’s a lot. There are 56 episodes that are each as technical and gory and fun. In the first 10 episodes Edward talks about 5 or 6 rewrites of large components of Pytorch. The shownotes have links to issues and PRs as well - here’s the RFC for meta-tensors to enable sharing of some CPU and GPU code. There is also an episode about how they support their ~2200 ish kernels and have to design systems to reduce dev complexity.
  3. I’m not totally sure about this, but I think this is a possible on-ramp to enabling newer people to be more capable contributors to the internals of AD and ML systems. Julia faces a lot of the same issues that PyTorch does/did, so there’s a lot we can learn from them. There’s a lot of places where the PyTorch devs went to great lengths to engineer systems to handle things that Julia handles almost trivially; there are also seemingly also some examples in the other direction.

One take-away is that PyTorch is not simple. We hope to be able to do more with less because many of us believe that Julia is more productive and cohesive (?) than Python plus C++. We will face a lot of the same issues they had and have some new ones ourselves. I think near and far horizon topics like compiler extensibility and native code-caching are also positive to the Julia ML scene, as they improve the dev and language user experience. Then would it just be a question of engineering time?

23 Likes

Incidentally, there is now a parallel discussion – about this thread – on Hacker News:

https://news.ycombinator.com/item?id=29902846

Interesting reading, I think, to get the opinions of those outside of the Julia-sphere. (As per also my answer to Q4.)

7 Likes

Please do not link directly to Hacker News, as they consider that to be brigading.

1 Like

I love Julia and it’s my first choice whenever possible. However, as soon as it comes to ANNs, I’m afraid I still turn to Python/PyTorch. I would prefer to use Julia instead, but figure it might be useful to share the downsides I’ve encountered for “mainstream” machine learning models.

Examples of state of art models are not readily available

For example, consider GPT-2 from 2018, ages ago in ML land :wink: . If one searches on google for “julia gpt-2” the first result is a blog post by someone named Julia describing a python implementation. I mention that partly in jest, but it’s an apt summary—even when explicitly looking for Julia implementations of SotA models, you’ll probably find a Python implementation first.

If you dig a little deeper, you’d find Transformers.jl, which does indeed have a GPT-2 implementation. But now try to find a Julia implementation of Visual Transformer, Longformer, Linformer, Compressive Transformer, RoBERTa, etc. One could readily find multiple example repos in Python, typically including reference implementations from the authors or a major project.

Reference models are often broken

For example, the reference VGG implementation for Flux.jl was broken for around two years when run on Nvidia GPUs until mid-2021. This is slightly unfair as there was a ton of changes in Flux in this period, including the transition to Zygote. Partly, I think it’s a reflection that there’s still a lot of research and experimentation as to the best way to do AD in Julia. Hopefully, this will continue to standardize around best practices and get robust over time.

Memory usage and speed are worse

I don’t want to dwell on this too much because my impressions may be out of date. I’ve tried porting over some large ResNet-style convolutional neural networks / VAEs to Flux that operate on giant 3D movies. I haven’t been able to run equivalently sized models with Flux vs PyTorch, although perhaps this has changed recently.

Performance benchmarks are hard to come by

A number of folks have made some really nice benchmarks to compare Flux / PyTorch / Tensorflow, but I’m not aware of any that are regularly maintained. So even as someone that’s eager to use Julia for ML since all my other code is in Julia already, it’s hard for me to assess if the ecosystem can meet my research needs without diving in to code up a benchmark.

Some useful recent benchmarks include:

GitHub - avik-pal/DeepLearningBenchmarks at update (Feb 2020; Flux within 0.5-1x of PyTorch for common layers)
Why is flux model slower than python? (Jan 2021; Flux 0.5x of PyTorch for VGG19)
Julia slowdown on long running programs with many allocations (June 2021; Issues with memory growth over time; solvable by calling garbage collector)
Allocation of Memory while evaluate a model (Nov 2021; Memory is allocated each time a model is run)

I think these highlight common pain points for a PyTorch developer that would consider switching to Flux.

There has and continues to be major, major progress in the Julia ML ecosystem, and there is a ton of cool stuff that can practically only be done in Julia (I’m looking at you SciML!). And it’s clear that Julia has an awesome / arguably the best skeleton for ANNs: Where we are headed and why it looks a lot like Julia (but not exactly like Julia) - compiler - PyTorch Dev Discussions. My sense is it’s largely a matter of having enough dev time. Google and Facebook have a ton of engineers working on their frameworks, and the Flux team has been disproportionately productive all things considered. I would think that companies allocating more resources to the ecosystem could really accelerate adoption.

17 Likes

Perhaps SizedArray from StaticArrays.jl may be useful here. On one hand, you are encoding the size of the array in the type. Thus array size errors will occur at compile time. On the other hand, using this may add compilation latency due to the need to compile functions for each array size.

https://juliaarrays.github.io/StaticArrays.jl/stable/pages/api/#SizedArray:-a-decorate-size-wrapper-for-Array

This thread proves to be an interesting read so far. I’m more of a silent observer here and don’t have any insightful comments on the technical details… but I’d like to point out certain implicit assumptions which have framed the discussion so far, and add my two cents.

The discussion so far has largely assumed “ML” to be synonymous with “deep learning” (and fwiw, not even touched on deep RL that much). I don’t know whether that was intentional, or simply by oversight. Whether this is the appropriate framing depends on the target audience/applications under consideration.

  1. Despite deep learning generating a lot of buzz, boring classical ML is 10x more common in applications (especially given that it’s far easier to understand & debug, and typically needs much less data than deep learning). There’s probably a very large audience of programmers who stay out of these discussions, and silently use scikit-learn (or the like) to apply ML to small/medium problems all over the place. This kind of programmer likely seeks a combination of stable and ergonomic APIs, thorough documentation and lots of examples. (not to mention a wide variety of models they can throw at their problem)

  2. Even in the context of deep learning, the most important thing for the “average user” is likely a large compendium of pre-trained models and tutorials for how to use them and hit the ground running when applying on some target problem which is a mild variation.

  3. There are many other interesting research areas in ML which are under active investigation (but with less hype) – some of which might prove to be of supreme importance a few years down the line. Unless that is a digression from the intended theme, I would love to see a broader discussion of applications and whether Julia can empower them in special ways (eg: probabilistic programming, graphical models, causal reasoning, reinforcement learning, auto-ML, etc. The examples stated are, of course, biased/limited by my background.)

  4. One magic area for Julia is how easy it is to do modeling (either with differential equations or otherwise), including things like propagating uncertainties or calibrating/tuning parameters (via optimization). While this looks very different from deep learning on the face of it, if we look past ML/DL (which are just tools to reach a more fundamental goal) towards the applications which they are being used for – we can expect systems built this way to perform better along multiple dimensions, compared to NNs trained “end-to-end”. I have anecdotal experience to this effect, but more examples/demonstrations of this would help evangelize this approach to building systems – and I expect that Julia could help make this under-appreciated approach more ubiquitous.

  5. I’m out of my depth here, but I’m really curious about the potential for “non standard interpretations”. While not technically “ML” in that no “model” is being “trained on data”, this has the potential to enable really powerful/generic code (“intelligent” programs). For a flavor, see some recent work by Tom Minka and Conal Elliott. Chris Rackauckas’ recent posts contain some nuggets in the SciML context, but I’d love to see a more cohesive perspective fleshing this out.

18 Likes