State of machine learning in Julia

I completely agree with what @patrick-kidger and @jgreener64 said. Julia has indeed a huge potential for machine learning, but its current state is a little bit mixed. Personally, coming from climate science and wanting to use SciML as a tool for my research, I’m left with mixed feelings. Some developers/researchers have a super solid background on computer science, and/or can afford spending a lot of time doing dev work. For others, like me, this is only a part of my job, and we could use a little more user-friendliness. I understand that this is also a consequence of the novelty of many of these libraries and methods, but I’m often struggling to find the necessary information in the documentation, and errors are often cryptic and hard to debug.

More specifically, the main reason I’m sticking with Julia for SciML is because the DifferentialEquations.jl library is top notch. It works super well, and I haven’t found anything similar in Python. However, it’s the AD part that is becoming a true pain for my research. I recently started a similar open discussion about the state of differentiable physics in Julia, which also highlighted some of the current limitations (and strong points) of the AD ecosystem in Julia. Since I started working with Julia, I’ve had two bugs with Zygote which have slowed my work by several months. On a positive note, this has forced me to plunge into the code and learn a lot about the libraries I’m using. But I’m finding myself in a situation where this is becoming too much, and I need to spend a lot of time debugging code instead of doing climate research. Moreover, the documentation of both Zygote and Flux is pretty small, and I already found myself making a PR to add some extra comments to the Zygote documentation, because as a newcomer to the library I felt completely lost at the beginning.

I guess all this will be fixed with time, as new people join the community and the libraries become more mature. I still think Julia is the best choice for SciML, but more care should be taken into making these libraries (and their documentation) more user friendly. Otherwise I can totally understand that a large pool of potential users gets scared away. Just my two humble cents.

25 Likes

Out of curiosity, what were these bugs?

2 Likes

The first one was extremely simple, but it was super hard to debug. Basically Zygote couldn’t provide a gradient for the sqrt function, since I was applying it to a matrix with zeros in it. Here is a GitHub issue with more details. Irrespective of the technicalities, sqrt is such a common function that having a bug on it will surely impact a large amount of users.

The other (and current) one is more obscure, and still under investigation. For some reason Zygote is giving me gradients that are all zero (which should probably just error), while another AD library (ReverseDiff) is working. The problem is that I need Zygote for my model to be able to backpropagate with acceptable speeds. ReverseDiff is just too slow for my case.

These bugs are problematic for the average user because (1) Zygote is pretty hard to debug, and (2) they require very specific skills which only the library developers have. Luckily, everyone in the Julia community is extremely helpful and nice. But I really wish I could handle more things by myself just by reading a more complete documentation or by having more meaningful errors.

7 Likes

We rely heavily on Julia for our Differential Equations work. But these issues have driven us back to Python for some of the ML stuff we have started doing.

These things impact other libraries as well. Julia needs to start taking these issues seriously by proiding library authors with tools that address these issues at the language level.

16 Likes

As someone mostly busy with a day job that restricts OSS dev I feel like I don’t have much standing to express what I do below, and it might not be well received. On the other hand, I also really love Julia and its amazing community. I am personally invested in seeing it succeed and thus would like to explore what I perceive as another important facet to some of the issues raised. I’m open to pushback or even being ignored.

Julia’s genius was that if you restrict dynamism a bit while having really clever design, you can keep the vast majority of what you like about python, get a lot of extra composability and have something that’s much easier on the compiler. Almost a pareto optimal situation.

This balance was developed in an era before deep learning, autodiff and research into next generation static FP programming languages, which both preserve more static information and improve usability and ergonomics. Dynamically typed has been stripped to its essence, and it’s very debatable whether it’s inherently better at all, much less with the trade offs that it comes with in trying to do full language differentiable programming. A time horizon of years has been thrown around regarding when Julia is fast enough in general at GPU + dp for fast at prosaic usecases to fall out of that. That might be optimistic and we also see that there correctness issues, which is even worse.

Now, The demands on a language are higher and the tradeoff space is different. We have dependent typing, type inference, effect systems, static languages with REPLs. We can have a language that encodes more static information, with a net improvement in usability for modern applications.

A language like Dex exhibits these. I’m concerned that while Julia chases asymptotically approaching the promise of full language, correct, fast dp, a language with better tradeoffs like Dex will get there first, and while preventing lots of bugs in performance and correctness that Julia hasn’t even begun to address.

Here’s an excerpt from the dex paper:

We feel that Dex gained a great deal as a language from being co-designed with its automatic differentiation system. AD is something like a very demanding user of the language—it is always trying
to write programs the compiler developers did not anticipate, and always producing compelling
bug reports or feature requests when those programs do not work or are slower than they should
be. In this section, we discuss a few specific subtleties in the design of Dex’s AD, and the effects
AD has had on the rest of the language and compiler

This perfectly describes the situation for the last five years.

I’ll repost something I said on slack:

Certainly the devs are spread thin, but debugging low level IRs and fragile compiler heuristics are always going to be more difficult than relying on static performance guarantees, as dex demonstrates for its reverse mode gradient function in the first image.

Brian said it best:

“In fact, Julia’s broadcasting is a good example of how a simpler, more composable interface can do the job of more complex, numerous and edge-case prone specialized compiler machinery. Making your loss function run fast and not allocate a ton without plugging in something on the level of XLA is still a tall order, and $DEITY help you if said function hits a non-optimized path in XLA too. Even in Julia land, we still don’t have a stable solution for fused broadcasts in a GPU-friendly reverse mode AD.”

Unless some sort of static system is introduced, I fear we’re always going to be chasing down not-so-corner cases in performance (computational complexity, memory and parallelism) and correctness (dex can guarantee these). I know it’s against Julia’s ethos, but it seems important for ml, ad and the accelerator world. Otherwise I fear there’s a structural constraint/ this will take an inordinate amount of work, which is what you imply when you say it will take years to get arbitrary code fast.

Maybe this semi-static plan is the answer? How can we enforce the right semantics, have them propagate across functions? If we just say "well it’s in the code and sometimes it will all just fit into place/ infer and sometimes it won’t and you’ll get any.

that problem explodes when we’re talking about AD, composability and gpu codegen. Like playing whack a mole. It’s not longer just about inlining and unboxing. We have correctness, computational complexity, accelerator codegen, parallelism etc to worry about now.

We’re in 2022 and Conv still isn’t type stable : Flux.Conv type instability · Issue #1178 · FluxML/Flux.jl · GitHub
And it’s not a trivial fix. (edited)

Dex is still a research project, and it certainly hasn’t proven that it could provide a solution to everything, but it feels like it’s going in the right direction. Even in a hypothetical situation where the e-graph passes work well, in dex everything is typed index sets, effects and loops, so it doesn’t need as many compiler heuristics.

It hasn’t proven itself capable yet of taking these inlined loops and generating fast accelerator code, but the MLIR project has plenty of people working on that and I think it should get there with time.

What about Julia’s strength? Well if you read this Rethink overloading · Issue #671 · google-research/dex-lang · GitHub one of the dex devs acknowledge that Julia secret composability sauce is the combination of subtyping and pervasive multiple dispatch. It remains to be seen whether Dex can provide a similar effect. If Julia never gets fast and correct enough, I don’t think the marginal benefit will matter, as ML languages aren’t slouches there either.

So, where does that leave Julia for the future. I think to break out of its niche use case, it has to be able to somehow make all this development intrinsically more tenable, especially if it lacks FAANG resources. It’s certainly not possible to dramatically shift the semantics of the language, but people are exploring ways to have opt in static features ( https://twitter.com/KenoFischer/status/1407810981338796035 and https://news.ycombinator.com/item?id=26136212 and JET.jl, and Jan Vitek’s group).

I wonder if this could all come together into a coherent story so we could have a language for ML that’s correct, ergonomic (including good error messages) and fast enough for both DP and prosaic ML, with type safety AND the composability of Julia. Why should someone prefer that over python if they are just stacking layers of fat linalg? Well that’s where DL is now, which could change. As Chris mentioned, there is arguably a sapir-whorf effect keeping it there.

Also, all the non ml stuff like data handling, munging, viz etc is much much more pleasant in Julia. (Pandas makes me :frowning: )

I’m rooting for Julia to become more prevalent in general ML/DS , but the situation is different now.

Not sure how hopeful I am at this point. I’ve seen very little acknowledgement of these structural technical issues. Is that a Kuhnian like ossification, which is an understandable part of normal human epistemology, or am I just totally wrong here? I’m very open to and would prefer the latter.

(Please forgive the somewhat ad hoc and not ideally organized/proofread response. I fired this off very quickly and the heat is broken )

25 Likes

I think it is important to note that we can already use explicit parameters with Flux/ Zygote/ Optimisers.jl. Simply put, use the model as an object to be differentiated. This will return back a NamedTuple which can be consumed by the optimisers in Optimisers.jl. I have been blocked on making them standard in Flux for some time now, but maybe we just take the jump, merge whatever needs to be merged to make it happen. I am certain that any issues with optimisers.jl that may come up (Inplace updates come to mind) can be solved with existing PRs to bridge that gap.

Re Conv: I’m pretty sure this is a regression, and something that will need to be addressed again in the future. The design hasn’t really been changed in the PR either, but making it easier on the compiler. Note that we are mostly tracking improvements in compile times there

1 Like

I think the difference in opinions shared here reflect the larger difference in mindset. @ChrisRackauckas reflects the scientific/research mindset and @patrick-kidger reflects the engineering mindset. I think where Julia really shine is the scientific and research applications and python shines in the engineering side of things. Even when many say Julia doesn’t have mature packages in certain ML domains, they are missing the existence of Julia APIs for python packages and PyCall library. Maybe we need to popularize the PyCall based libraries a little bit more to show that they are not missing their favorite mature python libraries in Julia. Moreover, as far as I understand, the Julian philosophy is not to reinvent everything that has been already developed in other languages but to design and develop tools that are nonexistent/design tools for the future.

2 Likes

Well, that’s part of the whack-a-mole dynamic I mentioned. Here it affects mostly compile times, (probably because a static block of code is guaranteed in cudnn, but I haven’t specifically checked), but that’s just an incidental benefit of this codepath which is a best case scenario.

My broader point is that this is not really the “fault” (I don’t like that word because it can connote a moral valence which doesn’t exist here), of the flux devs, but is something to be expected with the language design and problem domain.

Further regarding the benefits of dynamism in Julia, I was very kindly pointed to the following excerpts from Jeff’s thesis:

2 Likes

Can not agree more. I wasted a lot of time debugging Zygote for using it in my own research and also for recommending it to my friends (since I feel obligated to debug code for them). I did not have a similar experience when I was using PyTorch. One of the main reasons why Zygote is so unreliable is Zygote uses generic rules, and the complex numbers does not error me correctly when the AD rule is wrong. Now I do not believe generic autodiff can provide reliable gradients. Autodiff rules must be concrete.

14 Likes

A lot of this discussion is enlightening to read. I don’t have much experience as a deep user of the Julia ML stack, more as an observer. I’ve used a lot more of the python stack because of external requirements. It is interesting to see how things develop here.

One thing I would recommend to anyone fretting about the julia ML/DL space is to listen to the PyTorch Dev podcast. A good example episode is the one about meta-tensors and structured kernels. I would recommend this because:

  1. The episodes are ~15 minutes long, technical but not too complex, and relatively entertaining (Edward is good at discussing his thought process and providing color)
  2. It shows the engineering effort and considerations that have gone into PyTorch… it’s a lot. There are 56 episodes that are each as technical and gory and fun. In the first 10 episodes Edward talks about 5 or 6 rewrites of large components of Pytorch. The shownotes have links to issues and PRs as well - here’s the RFC for meta-tensors to enable sharing of some CPU and GPU code. There is also an episode about how they support their ~2200 ish kernels and have to design systems to reduce dev complexity.
  3. I’m not totally sure about this, but I think this is a possible on-ramp to enabling newer people to be more capable contributors to the internals of AD and ML systems. Julia faces a lot of the same issues that PyTorch does/did, so there’s a lot we can learn from them. There’s a lot of places where the PyTorch devs went to great lengths to engineer systems to handle things that Julia handles almost trivially; there are also seemingly also some examples in the other direction.

One take-away is that PyTorch is not simple. We hope to be able to do more with less because many of us believe that Julia is more productive and cohesive (?) than Python plus C++. We will face a lot of the same issues they had and have some new ones ourselves. I think near and far horizon topics like compiler extensibility and native code-caching are also positive to the Julia ML scene, as they improve the dev and language user experience. Then would it just be a question of engineering time?

28 Likes

Incidentally, there is now a parallel discussion – about this thread – on Hacker News:

https://news.ycombinator.com/item?id=29902846

Interesting reading, I think, to get the opinions of those outside of the Julia-sphere. (As per also my answer to Q4.)

9 Likes

Please do not link directly to Hacker News, as they consider that to be brigading.

2 Likes

I love Julia and it’s my first choice whenever possible. However, as soon as it comes to ANNs, I’m afraid I still turn to Python/PyTorch. I would prefer to use Julia instead, but figure it might be useful to share the downsides I’ve encountered for “mainstream” machine learning models.

Examples of state of art models are not readily available

For example, consider GPT-2 from 2018, ages ago in ML land :wink: . If one searches on google for “julia gpt-2” the first result is a blog post by someone named Julia describing a python implementation. I mention that partly in jest, but it’s an apt summary—even when explicitly looking for Julia implementations of SotA models, you’ll probably find a Python implementation first.

If you dig a little deeper, you’d find Transformers.jl, which does indeed have a GPT-2 implementation. But now try to find a Julia implementation of Visual Transformer, Longformer, Linformer, Compressive Transformer, RoBERTa, etc. One could readily find multiple example repos in Python, typically including reference implementations from the authors or a major project.

Reference models are often broken

For example, the reference VGG implementation for Flux.jl was broken for around two years when run on Nvidia GPUs until mid-2021. This is slightly unfair as there was a ton of changes in Flux in this period, including the transition to Zygote. Partly, I think it’s a reflection that there’s still a lot of research and experimentation as to the best way to do AD in Julia. Hopefully, this will continue to standardize around best practices and get robust over time.

Memory usage and speed are worse

I don’t want to dwell on this too much because my impressions may be out of date. I’ve tried porting over some large ResNet-style convolutional neural networks / VAEs to Flux that operate on giant 3D movies. I haven’t been able to run equivalently sized models with Flux vs PyTorch, although perhaps this has changed recently.

Performance benchmarks are hard to come by

A number of folks have made some really nice benchmarks to compare Flux / PyTorch / Tensorflow, but I’m not aware of any that are regularly maintained. So even as someone that’s eager to use Julia for ML since all my other code is in Julia already, it’s hard for me to assess if the ecosystem can meet my research needs without diving in to code up a benchmark.

Some useful recent benchmarks include:

https://github.com/avik-pal/DeepLearningBenchmarks/tree/update (Feb 2020; Flux within 0.5-1x of PyTorch for common layers)
Why is flux model slower than python? (Jan 2021; Flux 0.5x of PyTorch for VGG19)
Julia slowdown on long running programs with many allocations (June 2021; Issues with memory growth over time; solvable by calling garbage collector)
Allocation of Memory while evaluate a model (Nov 2021; Memory is allocated each time a model is run)

I think these highlight common pain points for a PyTorch developer that would consider switching to Flux.

There has and continues to be major, major progress in the Julia ML ecosystem, and there is a ton of cool stuff that can practically only be done in Julia (I’m looking at you SciML!). And it’s clear that Julia has an awesome / arguably the best skeleton for ANNs: Where we are headed and why it looks a lot like Julia (but not exactly like Julia) - compiler - PyTorch Dev Discussions. My sense is it’s largely a matter of having enough dev time. Google and Facebook have a ton of engineers working on their frameworks, and the Flux team has been disproportionately productive all things considered. I would think that companies allocating more resources to the ecosystem could really accelerate adoption.

23 Likes

Perhaps SizedArray from StaticArrays.jl may be useful here. On one hand, you are encoding the size of the array in the type. Thus array size errors will occur at compile time. On the other hand, using this may add compilation latency due to the need to compile functions for each array size.

This thread proves to be an interesting read so far. I’m more of a silent observer here and don’t have any insightful comments on the technical details… but I’d like to point out certain implicit assumptions which have framed the discussion so far, and add my two cents.

The discussion so far has largely assumed “ML” to be synonymous with “deep learning” (and fwiw, not even touched on deep RL that much). I don’t know whether that was intentional, or simply by oversight. Whether this is the appropriate framing depends on the target audience/applications under consideration.

  1. Despite deep learning generating a lot of buzz, boring classical ML is 10x more common in applications (especially given that it’s far easier to understand & debug, and typically needs much less data than deep learning). There’s probably a very large audience of programmers who stay out of these discussions, and silently use scikit-learn (or the like) to apply ML to small/medium problems all over the place. This kind of programmer likely seeks a combination of stable and ergonomic APIs, thorough documentation and lots of examples. (not to mention a wide variety of models they can throw at their problem)

  2. Even in the context of deep learning, the most important thing for the “average user” is likely a large compendium of pre-trained models and tutorials for how to use them and hit the ground running when applying on some target problem which is a mild variation.

  3. There are many other interesting research areas in ML which are under active investigation (but with less hype) – some of which might prove to be of supreme importance a few years down the line. Unless that is a digression from the intended theme, I would love to see a broader discussion of applications and whether Julia can empower them in special ways (eg: probabilistic programming, graphical models, causal reasoning, reinforcement learning, auto-ML, etc. The examples stated are, of course, biased/limited by my background.)

  4. One magic area for Julia is how easy it is to do modeling (either with differential equations or otherwise), including things like propagating uncertainties or calibrating/tuning parameters (via optimization). While this looks very different from deep learning on the face of it, if we look past ML/DL (which are just tools to reach a more fundamental goal) towards the applications which they are being used for – we can expect systems built this way to perform better along multiple dimensions, compared to NNs trained “end-to-end”. I have anecdotal experience to this effect, but more examples/demonstrations of this would help evangelize this approach to building systems – and I expect that Julia could help make this under-appreciated approach more ubiquitous.

  5. I’m out of my depth here, but I’m really curious about the potential for “non standard interpretations”. While not technically “ML” in that no “model” is being “trained on data”, this has the potential to enable really powerful/generic code (“intelligent” programs). For a flavor, see some recent work by Tom Minka and Conal Elliott. Chris Rackauckas’ recent posts contain some nuggets in the SciML context, but I’d love to see a more cohesive perspective fleshing this out.

19 Likes

This thread inspired me to try getting my hands dirty with ML in Julia and stream myself implementing MNIST classification from scratch in Julia as a noob. Intended as a sort of user experience report a la “Don’t Make Me Think”.

Video: shitty ML livestream 1: MNIST classification in Julia - YouTube

Of course, my (curmudgeonly) commentary was all off-the-cuff. In hindsight my one addendum would be that the logsoftmax issue makes sense to me now. I don’t really see that as being an issue.

13 Likes

There are some very interesting points brought up in this thread, and several points we are aware of as developers of the ecosystem. It is good that we are separating the concerns for “conventional” DL and SciML, since the challenges in the two fields are very different and can sometimes be at odds with each other.

For “conventional” DL, the good news is that while the ecosystem is maturing, performance issues are being rapidly addressed. The biggest missing feature is the ability to do full program analysis on the backwards pass, which projects such as AbstractInterpreter are all about. EscapeAnalysis.jl and others will let us actually optimise on the program, even if the program itself was generated with relatively low level IR. As Chris mentioned, one of the goals of Diffractor is also to make such tooling possible for us to use to produce better code, but the AD itself is not really interesting to conventional large transformers, and Diffractor is unlikely to affect training performance there. It will help us avoid unnecessary allocations etc. So in that way, it isn’t really Zygote which is to be blamed, rather a missing optimisation pass after the backwards pass is generated that we need to optimise it.

The other parts of the concern lie with documentation, available models, tutorials, benchmarks, data handling and distributed training. The story here is different and can actually be improved with many small steps. We have put together benchmarks to track performance at https://speed.fluxml.ai. Admittedly I am trying to push an update to the benchmarks so they update regularly with changes in the ecosystem, but the systems are in place and up and running. Documentation sharp edges and tutorials are always welcome! Having said that, it can be jarring to come from PyTorch/ TF and find that there aren’t as many helper utilities which seem to come up most often. In my view, more than API docs, we need to document usage patterns. This is something we should improve on. On the data handling subject, I agree we need to do better, its one of the areas where I feel Julia has a lot of potential. It is true that Julia packages don’t usually act as monoliths and therefore reaching out to tiny obscure packages for loading data seems daunting, but with projects such as DataSets.jl (see https://github.com/DhairyaLGandhi/ResNetImageNet.jl which combines it with Flux for distributed training alongside DaggerFlux), it is shown how flexible this can be, especially as we keep in mind the dP cases that Flux handles. I think what we need are motivating examples and higher level functions to bring these together. This is very different from how Julia packages usually work in terms of composability but may be worthwhile to point users to the different patterns they can use for different needs. Popular cases would involve loading and preprocessing images and textual data, for which we can have default implementations. We had a function Metalhead.preprocess to do exactly that, so its likely worthwhile to bring something like it back. The models in Metalhead have recently been updated, and we intend to host the pretrained weights again shortly too. This is delayed, but something that definitely is on the priority list. There are several people working on it to get it right. We also support loading transformers in via huggingface and Transformers.jl. We still need to write the code for more standard transformers. Any help on that front would be very dearly appreciated. The community is always forthcoming to those willing to extend the ecosystem. Having said that, there are several specialised pretrained models available including YOLO, as well as some pretrained transformers. On the more philosophical side; is Flux interested in “conventional” DL? Absolutely. Do we see Flux being used for production cases? Yes. There are known areas of improvement in this sphere in the larger ecosystem for sure, but work is ongoing.

For SciML, Chris mentioned several cases where our tooling is working towards. This also includes explicit parameter based models, which Optimisers.jl supports. However, incorrect gradients are not great. Please make sure to open relevant issues. We have set up several instances of reverse CI, to tackle testing the ecosystem better.

16 Likes

We (metalenz.com) use Julia for differentiable physics, which we take the step further to providing an engineer-friendly interface to model specifications for solving and optimizing PDEs. Think COMSOL or Simulink, with gradients for physical model parameters applied to user-defined objectives and constraints. With respect to a “traditional deep learning” approach, this is roughly equivalent to us needing to build out custom kernels and a custom way to specify models.

1. Where does Julia shine?

Along the spectrum of ML-like domains Chris laid out, our typical codes involve heavy numerics and PDE computations, where we also see opportunities in algorithm development (math) that remain untapped. Julia is great for this - we are able to write custom reverse passes for code which are more stable and exploit our problem structure, efficiently prototype new algorithms, and (if the algorithm permits) deploy on GPU with minimal effort.

We’ve also had to use higher-order autodiff, or written mixed-mode passes. As with any researchy problem, we did not know going into this that we would need these features - but I feel that Julia allows the flexibility for a sophisticated but small team to be massively productive, and I never have the concern that the next idea would hit a wall of impenetrable C++ code.

Where is the Julia ML ecosystem currently inferior?

As many commenters have pointed out, if you are trying to apply off-the-shelf models to a known problem, or using standard kernels, it doesn’t make sense to use Julia (and when that has been the case for us, we reached for Python).

What important experiments and benchmarks should we be tracking?

I agree that Python serves the traditional DL use-cases for the majority of users. But because we are aware of Julia’s competitive advantages, it makes sense to work towards accentuating and furthering them. At the same time, improving our performance by incorporating ideas from XLA or others will have knock-on benefits throughout the ecosystem. This is the point Chris makes.

I’ll add on that we do have some problem domains which benefit from differentiable in-place linear algebra code (his “missing middle”). I suspect we are rare in this, and though there are ways to mix the styles or provide hand-crafted reverse passes, making this seamless and correct would be fantastic.

What packages do you use and which packages do you wish existed?

Since the question of model specification came up in Chris’s reply, I’ll mention Functors.jl. Models in general can be thought of as existing in some vector space, and in our case such models may be naturally specified heterogeneously (some parts exist as scalars, others as vectors, others as custom types) though they map to a vector. It is powerful to specify the parameters you want to differentiate w.r.t in some model and have the gradients come out as the tangent to your model - all while staying faithful to how the model was represented in the first place. This builds on the excellent work of the ChainRules team, who have clearly thought hard about what’s really going on.

As for what I wish would exist - I mirror other’s sentiments that easier debugging or introspection of autodiff would be helpful. Really, Julia has made it so that the hard parts of custom autodiff are not necessarily the numerics, it’s knowing how to properly populate or use datastructures which constitute the way you want to reason about the problem.

25 Likes

I found a similar thread from 2019:

4 Likes

There are tons of pretrained models with PyTorch. It makes it very easy to deploy models.

I really love Julia. But it is easier to think that I will use Julia for analysis on a Notebook or some simulation, than a NLP Model with BERT that I may have later to deploy on Mobile.

On a recent project, I just had to load a BERT from HuggingFace and add a few extra layers with Tensorflow. It is something like 30 lines of code. The tokenizer is included and everything related to BERT is loaded from HuggingFace.

I can choose among tons of pretrained BERT. BERT Mobile, DistilBERT, Roberta, Tiny BERT, BERT Large. For a project, I tested several models to find out the most lightweight model that is “good enough”. It was just one line to change in the code.

The model is currently running on AWS. It is easy to deploy. We just need to save the configuration file and the vocabulary file of HuggingFace locally.

Then, this part is quite recent, but both solutions offer mobile deployement with React Native. Tensorflow.js or PyTorch Live. I tested Tensorflow.js. Neither HuggingFace nor Keras offer a JavaScript tokenizer for BERT but it is not the most complicated part and I found an implementation on GitHub. I will also test PyTorch Live.

Here are some of the actors that offer such “easy” process, with plenty of pretrained models to test:

  • Keras + TensorFlow Hub (TensorFlow only) maintained by Google Engineers
  • PyTorch Lighting + PyTorch Hub (PyTorch only) community oriented with the possibility to contribute models.
  • HuggingFace (NLP with PyTorch/Tensorflow/JAX/ONYX…) a french private company that is now a woldwide leader on NLP and offer a collaborative platform. You can download any model and dataset directly from your code, and upload your models.
  • OpenMMLab (CV with PyTorch) a community essentially backed up by Chinese developers. Even easier to use. You can deploy any state of art computer vision model just by editing a configuration file.

What I would really like to see in Julia is a platform of PyTorch + TensorFlow + Flux pretrained models and state of art implementations, that are easy to deploy in projects.

  • Implementations of State of art models in Julia with a few lines of code
  • Easy to edit configuration files
  • Collaborative database of pre-trained models
  • Collaborative database of datasets (for good measure)

It could be HuggingFace and OpenMM implementing things in Julia. The issue is that both have build a full set of libraries that are exclusively in Python. So it is a hudge project for them. What would be the incentive ?

It could be a similar Julia platform dedicated to deep learning. Such platform need to be backed up by some strong organisation. I don’t know if Julia computing is interested to push in the direction of Deep Learning.

I guess it is just a letter to Santa, but this is something that would make me switch to Julia for deep learning.

20 Likes