Lilith.jl is now called Avalon.jl

You know this moment when you realize you made a mistake. When registration to a flight ends in 5 minutes and you find yourself in a wrong terminal. Or its the first day at a new job and you understand they don’t have free coffee. Or when you create a package and then misspell its name every now and then.

That’s what happened to Lilith.jl - a deep learning library with focus on high performance and interoperability (see original announcement). After almost a year of misspelling and mispronouncing the name, it’s time to finally give it another one - Avalon.jl.

The old repository stays in place, but all new features, fixes and discussions will be held in the new project. Stay tuned and ask about free coffee during interview!


Looks like really solid work. Nice! What’s your take on Flux vs your efforts here? Different design philosophies?


Flux is really good for scientific ML, Avalon is designed for production ML. Imagine 2 scenarios:

  1. You work on a scientific project, say, in physics. You already have a program that you optimized for your needs, but now you want to tune a few parameters using differentiable programming. Flux/Zygote will really shine here, because authors put incredible effort to make this stack as flexible out of the box as possible.
  2. You work on a classic deep learning task like language modeling or computer vision. There’s a new state of the art paper for your task, and a few people even implemented it, but all implementations are in Python. Of course, you can re-implement them in Flux, but due to the difference in API, optimizers, initializers, etc. it may take a lot of time to replicate the results. Avalon follows PyTorch API pretty closely, so usually translating a new model to it takes half an hour (if all parts of relevant API are implemented, of course). In future, it should also be possible to simply export a pretrained model from PyTorch to ONNX and then import it to Avalon.

Other differences include:

  1. Performance. Being very flexible, Flux will let you differentiate even through a program badly suited for it (usually it includes all functions implemented using loops instead of array primitives). Avalon is more strict in this regard - if there’s no (usually optimized) differentiation rule for a primitive function, it will fail and you will have to go and add the rule. Surely, it somewhat restricts you, but in return you get speed on par with leading deep learning frameworks.
  2. Backward compatibility. I haven’t used Flux for quite some time, but my last few attempts looked like this: install the library, find a tutorial, find a list of changes between the version in the tutorial and on your machine, find out that half of the auxiliary libraries don’t support latest version of Flux… go back to PyTorch. That’s ok for academia, but in the industry it takes too much time to continuously update old code. Thus one of the key promises of Avalon is as high backward compatibility as I can support.
  3. Architecture. Avalon is build on top of Yota - a tape-based autodiff library. Tape is incredibly easy to work and debug structure. It’s easy to inspect it, easy to optimize, easy to generate code from it (Julia, ONNX or whatever else), etc. I’m not sure this point is actually important for anyone but me, however I’m not going to give this simplicity away :slight_smile:

(Please note again, that I’m not following Flux development closely and may be wrong about any of these points)


It has no documentation?

There are several tutorials and model zoo to help you get started. The best overview of implemented API is the export list. Docstrings are quite sparse (I swear one day I will fill them… one day…), but since API mostly replicates PyTorch’s one, you can look their for details (adjusted to the order of array dimensions in Julia/Python).

Awesome. Look forward to experimenting with it. :slight_smile: Thanks for the great work.

1 Like

Thanks, this looks interesting. How do you see the pros and cons of Avalon.jl relative to Knet.jl?

I have yet looked deeply into Avalon.jl, but it looks really interesting. However, I can say that Flux is meant to be a thin wrapper around the core capabilities such as AD and JuliaGPU. It is great to thus see more exploration and have this focus on production ML.


1 Like

As far as I understand from the export list, Knet still uses custom array type (KnetArray) and operator overloading for its autodiff implementation (see AutoGrad.jl). I used similar approach in one of the previous AD implementations using TrackedArray type, but found it quite limiting:

  1. You can’t trace through code not compatible with TrackedArray. Say, some 3rd party library exports function foo(::Array), but not for foo(::TrackedArray) or foo(::AbstractArray). In this case you either need to make a PR to widen the accepted type, or make foo() a primitive and define derivative rule for it manually. Both are no fun at all.
  2. Array wrappers become a hell. Say you have a at::TrackedArray which references a::Array. But then you create LinearAlgebra.Diagonal from it. Should the chain of wrappers be Diagonal -> TrackedArray -> Array or TrackedArray -> Diagonal -> Array. In practice you inevitably get the mix of both, and making operator overloading over them becomes a very non-trivial task.

Avalon/Yota instead trace function execution without changing the types and thus is not affected by these issues. On the cons side, tracing is better suited for static computation graphs, so dynamic functions (with loops and conditions outside of primitives) may be slower than in Knet.

Note that at some point I tried to replace Knet’s AutoGrad with my Yota, but all parts of the framework were so deeply integrated with each other that creating a new deep learning library appeared easier.

Avalon tries to keep things mostly independent. If you don’t like API, you can take pure Yota and build your own framework on top of it. If you don’t like Yota, replace Yota.grad() with MyCoolPackage.grad() and things should work automagically. So keeping everything hackable is also part of the philosophy.


Same thing about Avalon - it uses Yota.jl for AD, CUDA.jl for GPU support, NNlib.jl for many of its primitives. Flux uses Zygote instead of Yota, but they both use IRTools to trace function execution. I wasn’t able to re-use ChainRules, though, but I hope these things will converge one day too.

(Isn’t it great how packages in Julia share common parts? What other language could afford it?)


It will be interesting to see if Diffractor has a role here when it comes out. Separately, any thoughts on Torch kernels in Torch.jl? Just file away issues in the relevant packages, for what is necessary.


1 Like

Currently I don’t have plans to integrate any of them. I think future of both - Diffractor and kernels in Torch.jl - heavily depends on the direction of mainstream AI research. Take Transformers, for example. Recently they became state of the art tool for NLP, have applications in CV and even such advanced tasks as protein structure discovering. But if you look at their architecture, the most complex operation there is batch matrix multiplication! No complicated kernels like depthwise convolution, no higher-order derivatives or whatever. In Avalon, I don’t try to make the package cool or suitable for all possible use cases, but instead to concentrate on the core values of any deep learning library and keep things simple. Of course, this doesn’t mean Avalon is not going to change, just that it prioritizes quality of base building blocks over possible advantages of experiment ones.


This is a great and pragmatic approach.
If you target production, this is the approach to take.


Being a wrapper around core primitives is one design goal of Flux. Additionally, that doesn’t come with the same performance hits as before. Although, if there are cases where we see regressions, or performance concerns, we try to resolve them quickly, so Zygote shouldn’t have much issue with performance, barring cases where its harder for Julia to actually optimise the differentiating code. This is partly what Diffractor would address, so we should see Flux get faster still. Underneath, the two share a lot of infrastructure, as we see. You shouldn’t see too much difference for models such as YOLO or whatever, esp for production use cases.

Beyond that, optimizations around specific forward passes/ pullbacks etc are always welcome.

Let me show you an example where Yota and Zygote behave differently:

using Zygote
using BenchmarkTools

foo(A) = sum([x + 1 for x in A])
A = rand(10_000);
@btime foo'(A);
# ==> 106.426 μs (45 allocations: 939.03 KiB)

Zygote does a good job differentiating through array comprehension, but it hides a performance issue - the same function can be written much more efficiently:

foo2(A) = sum(A .+ 1)
@btime foo2'(A)
# ==> 7.989 μs (3 allocations: 78.23 KiB)

Yota intentionally doesn’t support things like array comprehensions:

using Yota
using BenchmarkTools

foo(A) = sum([x + 1 for x in A])
A = rand(10_000);
@btime grad(foo, A)
# ==> ERROR: MethodError: no method matching var"#1#2"()
# ==> ...
# ==>  [3] foo at ./REPL[6]:1 [inlined]

So you have to look at foo() and realize this is not what Yota expects. You go and rewrite it to foo2(), which works fine:

foo2(A) = sum(A .+ 1)
@btime grad(foo2, A);
# ==> 14.151 μs (22 allocations: 157.02 KiB)

(note that here Yota is slower than Zygote due to constant overhead which is negligible in real ML models)

Surely, it would be better for both libraries to show warnings or even rewrite such cases automatically, but we are not there yet.

so Zygote shouldn’t have much issue with performance, barring cases where its harder for Julia to actually optimise the differentiating code

Note that putting restrictions on supported code opens the doors to optimizations beyond what the compiler can do. Avalon/Yota expect ML models to be pure computational graphs without side effects. Such graphs can be transformed in many different ways, e.g. by eliminating common subgraphs or replacing known primitives with their in-place versions, etc. As far as I know, doing the same thing for pullback-based AD is a way harder.


Thank you for the package and the explanaition of their design ideas. I like it.

I have a question, it is possible to do transfer learning with the package? There is an API (or a simple way) to do that?

Yes, definitely, Avalon was created with transfer learning in mind. However, whether the package is suitable for your tasks right now depends on your expectations. What you can already do is to train one model and use it as a field in another model (models are just Julia structs). Roughly speaking:

model_a = ModelA()
fit!(model_a, some_data)

mutable struct ModelB

function (m::ModelB)(x::AbstractArray)
    y = m.model_a(x)
    y = m.linear(y)
    return y

But of course the real power of transfer learning comes from a number of pretrained models. Here I make a bet (but haven’t implemented yet) on ONNX import. After corresponding branch is ready, I imagine API something like this:

mutable struct ModelB
    resnet::GenericModel   # can hold any ONNX structure

ModelB() = ModelB(load_from_onnx("/path/to/resent50.onnx"), Linear(1000 => 10))

Right now ONNX isn’t very relevant to me, so work on it is on pause, but I accept feature requests :wink:


I agree with the general thrust of this statement, but given the existence of the implied contrapositive (Flux isn’t suitable for non-sci ML) doesn’t sit well with me. Just wanted to offer a couple of counterpoints about “production ML”:

Disclaimer: I don’t use any of these libraries for my day-to-day work. I also dislike the trend of suggesting Flux to anyone looking for a Julia DL framework without digging into their use-cases, level of experience and risk tolerance first for many of the same reasons you outline.

  1. As a PyTorch user, I care about operator coverage as well. We’re not talking exotic ones like SVD layers, but RNNs, transposed/upsampling convolutions (e.g. for UNets), group/layer norm and dropout.

  2. Nested gradients and higher-order autodiff are useful outside of SciML. Meta-learning is perhaps the flagship example, but I think a more relevant one would be newer optimizers like ADAHESSIAN. I could absolutely see myself using such an optimizer for fast prototyping of otherwise “boring” models.

At the end of the day, I think it’s amazing that Julia library authors are willing to collaborate and willing to accommodate others in order to maximize library interop. Watching the balkanization of Python ML libraries/frameworks has been extremely frustrating, and I’d strongly advocate for the Julia ML ecosystem to push back on this “private islands” mentality wherever possible. That includes building out more NNlib-like infrastructural components so that frameworks can focus on their core competencies.


I would actually very much like to understand, why Iota is more performant that Flux. Where is the secret sauce?
We are doing a training of very large models in our Mill.jl / JsonGrinder.jl libs and we have spent quite some time making them performant, including preallocation.
While I would be interested in trying it, not supporting ChainRules is a stopper to me. In my ideal world, different ADs should be changeable as we change BLAS.

Oh, I’m sorry it sounded like that! I really didn’t mean that Flux is not suitable for production, just that it focuses on other things. Large projects like PyTorch with huge user base and backed by multi-billion companies can focus on hundreds of things at the same time, but both - Flux and Avalon - are quite tiny and thus have to choose areas to spend most times on.

Consider higher-order derivatives, for example. It’s not too hard to add them to Yota, actually. But it’s not enough to add a new feature, it’s necessary to support it in all future versions! If I were to implement higher-order derivatives, every time I add a new diff rule I would have to think if it won’t break anything. That’s a huge time investment, and without clear benefits most likely not worth it.

Or take a look at ONNX.jl. In the industry, ONNX is huge - it lets you export models to alternatives runtimes (e.g. mobile) or import pretrained high-quality models. It has been implemented for Flux years ago, but now it fails its own tests - a serious issue for the industry, but not so important for scientific ML.

(For the record: PyTorch can export models to ONNX, but not import them. While there are many people asking for it, the entry threshold seems to be too high for causal users to go and implement it. And it’s where Julia really shines - if you really want something, you just go and do it yourself in a couple of days).