This is a great and pragmatic approach.
If you target production, this is the approach to take.
This is a great and pragmatic approach.
Being a wrapper around core primitives is one design goal of Flux. Additionally, that doesn’t come with the same performance hits as before. Although, if there are cases where we see regressions, or performance concerns, we try to resolve them quickly, so Zygote shouldn’t have much issue with performance, barring cases where its harder for Julia to actually optimise the differentiating code. This is partly what Diffractor would address, so we should see Flux get faster still. Underneath, the two share a lot of infrastructure, as we see. You shouldn’t see too much difference for models such as YOLO or whatever, esp for production use cases.
Beyond that, optimizations around specific forward passes/ pullbacks etc are always welcome.
Let me show you an example where Yota and Zygote behave differently:
using Zygote using BenchmarkTools foo(A) = sum([x + 1 for x in A]) A = rand(10_000); @btime foo'(A); # ==> 106.426 μs (45 allocations: 939.03 KiB)
Zygote does a good job differentiating through array comprehension, but it hides a performance issue - the same function can be written much more efficiently:
foo2(A) = sum(A .+ 1) @btime foo2'(A) # ==> 7.989 μs (3 allocations: 78.23 KiB)
Yota intentionally doesn’t support things like array comprehensions:
using Yota using BenchmarkTools foo(A) = sum([x + 1 for x in A]) A = rand(10_000); @btime grad(foo, A) # ==> ERROR: MethodError: no method matching var"#1#2"() # ==> ... # ==>  foo at ./REPL:1 [inlined]
So you have to look at
foo() and realize this is not what Yota expects. You go and rewrite it to
foo2(), which works fine:
foo2(A) = sum(A .+ 1) @btime grad(foo2, A); # ==> 14.151 μs (22 allocations: 157.02 KiB)
(note that here Yota is slower than Zygote due to constant overhead which is negligible in real ML models)
Surely, it would be better for both libraries to show warnings or even rewrite such cases automatically, but we are not there yet.
so Zygote shouldn’t have much issue with performance, barring cases where its harder for Julia to actually optimise the differentiating code
Note that putting restrictions on supported code opens the doors to optimizations beyond what the compiler can do. Avalon/Yota expect ML models to be pure computational graphs without side effects. Such graphs can be transformed in many different ways, e.g. by eliminating common subgraphs or replacing known primitives with their in-place versions, etc. As far as I know, doing the same thing for pullback-based AD is a way harder.
Thank you for the package and the explanaition of their design ideas. I like it.
I have a question, it is possible to do transfer learning with the package? There is an API (or a simple way) to do that?
Yes, definitely, Avalon was created with transfer learning in mind. However, whether the package is suitable for your tasks right now depends on your expectations. What you can already do is to train one model and use it as a field in another model (models are just Julia structs). Roughly speaking:
model_a = ModelA() fit!(model_a, some_data) mutable struct ModelB model_a::ModelA linear::Linear end function (m::ModelB)(x::AbstractArray) y = m.model_a(x) y = m.linear(y) return y end
But of course the real power of transfer learning comes from a number of pretrained models. Here I make a bet (but haven’t implemented yet) on ONNX import. After corresponding branch is ready, I imagine API something like this:
mutable struct ModelB resnet::GenericModel # can hold any ONNX structure linear::Linear end ModelB() = ModelB(load_from_onnx("/path/to/resent50.onnx"), Linear(1000 => 10))
Right now ONNX isn’t very relevant to me, so work on it is on pause, but I accept feature requests
I agree with the general thrust of this statement, but given the existence of https://github.com/FluxML/ML-Coordination-Tracker the implied contrapositive (Flux isn’t suitable for non-sci ML) doesn’t sit well with me. Just wanted to offer a couple of counterpoints about “production ML”:
Disclaimer: I don’t use any of these libraries for my day-to-day work. I also dislike the trend of suggesting Flux to anyone looking for a Julia DL framework without digging into their use-cases, level of experience and risk tolerance first for many of the same reasons you outline.
As a PyTorch user, I care about operator coverage as well. We’re not talking exotic ones like SVD layers, but RNNs, transposed/upsampling convolutions (e.g. for UNets), group/layer norm and dropout.
Nested gradients and higher-order autodiff are useful outside of SciML. Meta-learning is perhaps the flagship example, but I think a more relevant one would be newer optimizers like ADAHESSIAN. I could absolutely see myself using such an optimizer for fast prototyping of otherwise “boring” models.
At the end of the day, I think it’s amazing that Julia library authors are willing to collaborate and willing to accommodate others in order to maximize library interop. Watching the balkanization of Python ML libraries/frameworks has been extremely frustrating, and I’d strongly advocate for the Julia ML ecosystem to push back on this “private islands” mentality wherever possible. That includes building out more NNlib-like infrastructural components so that frameworks can focus on their core competencies.
I would actually very much like to understand, why Iota is more performant that Flux. Where is the secret sauce?
We are doing a training of very large models in our Mill.jl / JsonGrinder.jl libs and we have spent quite some time making them performant, including preallocation.
While I would be interested in trying it, not supporting ChainRules is a stopper to me. In my ideal world, different ADs should be changeable as we change BLAS.
Oh, I’m sorry it sounded like that! I really didn’t mean that Flux is not suitable for production, just that it focuses on other things. Large projects like PyTorch with huge user base and backed by multi-billion companies can focus on hundreds of things at the same time, but both - Flux and Avalon - are quite tiny and thus have to choose areas to spend most times on.
Consider higher-order derivatives, for example. It’s not too hard to add them to Yota, actually. But it’s not enough to add a new feature, it’s necessary to support it in all future versions! If I were to implement higher-order derivatives, every time I add a new diff rule I would have to think if it won’t break anything. That’s a huge time investment, and without clear benefits most likely not worth it.
Or take a look at ONNX.jl. In the industry, ONNX is huge - it lets you export models to alternatives runtimes (e.g. mobile) or import pretrained high-quality models. It has been implemented for Flux years ago, but now it fails its own tests - a serious issue for the industry, but not so important for scientific ML.
(For the record: PyTorch can export models to ONNX, but not import them. While there are many people asking for it, the entry threshold seems to be too high for causal users to go and implement it. And it’s where Julia really shines - if you really want something, you just go and do it yourself in a couple of days).
I think the right question is why Flux is slow(er) for your use case. With all primitives like array operations and memory management on GPU already optimized by underlying frameworks, all libraries should perform approximately the same unless there’s a specific issue. It may be double conversion between
Float64, or one slow adjoint, or unrecognized optimization. I think digging into your current implementation and profiling every piece will give you more benefits in less time than switching to another AD.
Yet, ChainRules is definitely on my list, so hopefully at some version switching between AD implementation will indeed be as easy as for BLAS
Avalon looks great ! I’m really looking forward to trying it out !
And I agree it’s important to consider the “need for speed” in both Production execution , and supporting the software development lifecycle. So I looked for and found the NewPkgEval.jl package to help Developers maintain backward compatibility and detect early any “build breakers” before they impact Multiple Package Integration Test/IT environments.
For your consideration it appears NewPkgEval.jl has methods that could automate parts of your backward compatibility tests ; NewPkgEval.jl @@ https://github.com/JuliaCI/PkgEval.jl automatically obtains multiple versions of Base Julia e.g. v1.05, v1.4.2, v1.5 etc. locally for you and provides “ … the following commands to run the tests of a list of packages on a selection of Julia versions : …”
IOW >> NewPkgEval.jl Helps **Answer the burning Question : **
Why does my package fail?
If you want to debug why your package fails, it’s probably easiest to use an interactive shell:
julia> using PkgEval
Ps> And Yes, always ask about the free coffee and Ping Pong tables upfront ; because if they say no to ^that^ then “Free massages” is probably out of the question.
Of course, and there’s something to be said for being more proactive about putting warning signs up on bitrotted projects so that people understand what is actively maintained vs obsolete or just an experiment. I don’t want to tread on any feet here, but ONNX.jl, is kind of the poster child for that in FluxML right now.
That point was a bad distraction on my part, but the first one about missing dropout etc. still stands. Per your point though, that’s not a indictment of Avalon for being “unsuitable” for production, but rather an acknowledgement that it has a specific focus and that “production” (or applied research, in my case) requirements can be somewhat variable depending on the user.
I think the silver lining here is that not drowning in resources incentivizes the ecosystem to pool them instead of creating a separate stack for each framework. Hence why everyone is able to profit off NNlib, CUDA, ChainRules and the like. This even applies beyond the code level: did you know we had a lengthy Github thread and multiple Zulip topics about ONNX import/export?
This looks interesting, thanks! Usually I check the status of the package in all supported Julia versions in CI, but it doesn’t cover GPU stuff, so testing locally can be a good option.
Exactly. Frankly speaking, I can’t recommend neither Flux, nor Avalon as the main deep learning library for someone in the industry (just yet), and not because dropout or transposed convolutions are missing, but because there are still too many bugs and caveats. Sometimes these issues come from third party libraries and have quite long way to fixes (e.g. like this bug in CUDA.jl), sometimes they hit corner cases and take weeks to fix (e.g. this one). But it’s part of infrastructure maturing - when GPU stuff, web programming, API clients, big data tools, etc. are ready, we will already have ML kitchen in a good state to finally replace Python.
What is good about existing deep learning libraries in Julia is that they are already suitable for certain tasks (e.g. I used Avalon extensively for my representation learning experiments, some of them can be found in model zoo) and if something is missing, it’s usually not too hard to add it (e.g. I’m currently working on Transformers which require at least
Embedding layer, so that’s my next goal).
No I didn’t, thanks for letting me know! It will definitely influence my work in ONNX branch.
Awesome, I could’ve sworn I saw you on the GH thread but must have confused it with Discourse. Either way, things might be ramping up on the Flux end. It would be great to get your input to see how much could be made framework agnostic!
Looks hot! I’m diving in.
Is that’s what you mean with “interoperability with existing DL frameworks”? You may want to spell it out in the docs. And since plural what other? ONNX? And/or PyTorch Lightning? I’m not up-to-speed on the latter, or if some Julia package corresponds to it. In general, see also:
Since you link to [vision] transformer, would your package be best to replicate BERT-models, GPT-3, or Google’s even larger Switch Transformer model? Since it’s sparse, would that be a hindrance?
Google’s 1.6 trillion parameter model:
Not yet, but both of these are indeed on the roadmap.
I think for such large-scale models with only known and well-optimized layers it doesn’t really matter what framework you use - they all will have the same size and approximately the same speed. One thing I can promise about Avalon (when it gets to its full vision) is that for any recent and more or less popular model you will be able to find existing PyTorch implementation, translate it to Avalon in under an hour and load pretrained weights (if any) via ONNX.
Right now it’s not possible to replicate such large models in any Julia library (even Flux + Transformers.jl) because none of them support distributed training. It’s being worked on though!
Given the estimated cost of these models, adding distributed training capabilities doesn’t seem to be the biggest problem
Right, but there’s definitely a continuum between say, AlexNet and GPT-3. Also worth considering that many folks are working with multi-GPU systems where each card has a limited amount of VRAM. Even something as “mainstream” as training a ResNet on ImageNet becomes significantly more painful when you’re forced to run it with a small batch size on a single device.