Lilith.jl is now called Avalon.jl

I think the right question is why Flux is slow(er) for your use case. With all primitives like array operations and memory management on GPU already optimized by underlying frameworks, all libraries should perform approximately the same unless there’s a specific issue. It may be double conversion between Float32 and Float64, or one slow adjoint, or unrecognized optimization. I think digging into your current implementation and profiling every piece will give you more benefits in less time than switching to another AD.

Yet, ChainRules is definitely on my list, so hopefully at some version switching between AD implementation will indeed be as easy as for BLAS :slight_smile:

2 Likes

Avalon looks great ! I’m really looking forward to trying it out !

And I agree it’s important to consider the “need for speed” in both Production execution , and supporting the software development lifecycle. So I looked for and found the NewPkgEval.jl package to help Developers maintain backward compatibility and detect early any “build breakers” before they impact Multiple Package Integration Test/IT environments.

For your consideration it appears NewPkgEval.jl has methods that could automate parts of your backward compatibility tests ; NewPkgEval.jl @@ https://github.com/JuliaCI/PkgEval.jl automatically obtains multiple versions of Base Julia e.g. v1.05, v1.4.2, v1.5 etc. locally for you and provides “ … the following commands to run the tests of a list of packages on a selection of Julia versions : …”

IOW >> NewPkgEval.jl Helps **Answer the burning Question : **
Why does my package fail?
If you want to debug why your package fails, it’s probably easiest to use an interactive shell:
julia> using PkgEval

Ps> And Yes, always ask about the free coffee and Ping Pong tables upfront ; because if they say no to ^that^ then “Free massages” is probably out of the question. :wink:

2 Likes

Of course, and there’s something to be said for being more proactive about putting warning signs up on bitrotted projects so that people understand what is actively maintained vs obsolete or just an experiment. I don’t want to tread on any feet here, but ONNX.jl, is kind of the poster child for that in FluxML right now.

That point was a bad distraction on my part, but the first one about missing dropout etc. still stands. Per your point though, that’s not a indictment of Avalon for being “unsuitable” for production, but rather an acknowledgement that it has a specific focus and that “production” (or applied research, in my case) requirements can be somewhat variable depending on the user.

I think the silver lining here is that not drowning in resources incentivizes the ecosystem to pool them instead of creating a separate stack for each framework. Hence why everyone is able to profit off NNlib, CUDA, ChainRules and the like. This even applies beyond the code level: did you know we had a lengthy Github thread and multiple Zulip topics about ONNX import/export?

1 Like

This looks interesting, thanks! Usually I check the status of the package in all supported Julia versions in CI, but it doesn’t cover GPU stuff, so testing locally can be a good option.

1 Like

Exactly. Frankly speaking, I can’t recommend neither Flux, nor Avalon as the main deep learning library for someone in the industry (just yet), and not because dropout or transposed convolutions are missing, but because there are still too many bugs and caveats. Sometimes these issues come from third party libraries and have quite long way to fixes (e.g. like this bug in CUDA.jl), sometimes they hit corner cases and take weeks to fix (e.g. this one). But it’s part of infrastructure maturing - when GPU stuff, web programming, API clients, big data tools, etc. are ready, we will already have ML kitchen in a good state to finally replace Python.
What is good about existing deep learning libraries in Julia is that they are already suitable for certain tasks (e.g. I used Avalon extensively for my representation learning experiments, some of them can be found in model zoo) and if something is missing, it’s usually not too hard to add it (e.g. I’m currently working on Transformers which require at least Embedding layer, so that’s my next goal).

No I didn’t, thanks for letting me know! It will definitely influence my work in ONNX branch.

5 Likes

Awesome, I could’ve sworn I saw you on the GH thread but must have confused it with Discourse. Either way, things might be ramping up on the Flux end. It would be great to get your input to see how much could be made framework agnostic!

2 Likes

Looks hot! I’m diving in.

2 Likes

Is that’s what you mean with “interoperability with existing DL frameworks”? You may want to spell it out in the docs. And since plural what other? ONNX? And/or PyTorch Lightning? I’m not up-to-speed on the latter, or if some Julia package corresponds to it. In general, see also:

Since you link to [vision] transformer, would your package be best to replicate BERT-models, GPT-3, or Google’s even larger Switch Transformer model? Since it’s sparse, would that be a hindrance?

Google’s 1.6 trillion parameter model:

1 Like

Not yet, but both of these are indeed on the roadmap.

I think for such large-scale models with only known and well-optimized layers it doesn’t really matter what framework you use - they all will have the same size and approximately the same speed. One thing I can promise about Avalon (when it gets to its full vision) is that for any recent and more or less popular model you will be able to find existing PyTorch implementation, translate it to Avalon in under an hour and load pretrained weights (if any) via ONNX.

5 Likes

Right now it’s not possible to replicate such large models in any Julia library (even Flux + Transformers.jl) because none of them support distributed training. It’s being worked on though!

1 Like

Given the estimated cost of these models, adding distributed training capabilities doesn’t seem to be the biggest problem :slight_smile:

1 Like

Right, but there’s definitely a continuum between say, AlexNet and GPT-3. Also worth considering that many folks are working with multi-GPU systems where each card has a limited amount of VRAM. Even something as “mainstream” as training a ResNet on ImageNet becomes significantly more painful when you’re forced to run it with a small batch size on a single device.

5 Likes