Deep learning in Julia

I’m been wondering about the general status of the deep learning ecosystem in Julia. We obviously have the big project Flux.jl, as well as a similar but mutable version in Lux.jl.

I bring this up because it would be FANTASTIC to be able to mess around with bits of language models in Julia (we have some GSoC projects for this), but it just sparked the thought: what would we do now with deep learning in Julia that wasn’t possible or we didn’t have the foresight to do when Flux started?

I’m mostly just curious, but what would your ideal, fresh, greenfield deep learning framework look like in Julia? Knowing what we know about how the community, the language, and deep learning has evolved, what would be your “best case” deep learning interface/tooling/etc?

Some prompts:

  • What has Flux done right?
  • What slowed it down, if at all?
  • What architectural decisions ended up being painful?
  • Is the user experience good? What would make it better?

If you wrote a new library, from scratch, with no time or resource constraints, how would you do it, and what would be new or different?

12 Likes

Also: this discussion of Lux.jl is great and topical.

1 Like

Also related to ML more broadly:

1 Like

A very thorough analysis of nice-to-haves:

1 Like

It would be Lux. Back in 2020 it was clear that was the direction to go, it fixed the SciML examples and got things shipping. @avikpal really did a good job with it and now all of the SciML side cases have been deleted and it all uses Lux. Every single major hack we had around some issue in Flux was just fixed by using Lux in its standard tutorial way. And we spread it around too: lots of cases where people had issues doing SciML type things I tell them just switch to Lux and they come back a month later saying the project is done.

So if I had a magic wand, Flux’s interface would just be what Lux is today. But since I don’t have that magic wand I just tell people to use Lux. Its documentation is nice, its interface guards against a lot of the deficiencies of Flux, its pure functions are easier to understand and inspect, and it works more directly with more AD engines. To me, it’s a very clear decision.

For brevity I’ll keep this just to the deep learning frameworks and not AD, since AD has its own topics which go into detail on these points.

Flux’s way of handling type promotion is the main one. Layers are treated as having a type, i.e. a function is treated as having a type. It’s basically as if f(x)::Float32 = 2*x is the default, i.e. the ::Float32 on the output. So you write a function, and it always changes your type by default to Float32. Maybe that’s nice for some folks in deep learning, but that’s not how Julia functions work. And because Flux functions don’t act like a typical Julia function, that means that everything that expects Chains to act like a normal function need to be specialized, including ForwardDiff, ReverseDiff, every SciML adjoint, … . There is a nice fix though: Lux has an auto-conversion tool, and if you convert the Flux layers to Lux layers they will act like a normal Julia function and that will fix the type handling, and that’s how we ended up fixing most adjoint definitions.

16 Likes

Okay I guess nerd snipe, let me give some detail. Here’s a few PRs to look into:

“Remove Flux” being labelled as “fix tests and docs” because there were some long-standing issues that seemed like they didn’t have a fix until this change happened, so it seemed like a miracle.

As to why this is such an issue, we captured it in detail in some discussions. It really comes down to the conversion feature I discussed above, and the handling of arrays via restructure/destructure. For the latter discussions, see the following:

To highlight a piece in there, the following example points out some of the issues we have with treating Flux models as generic Julia functions:

  julia> v, re = destructure((x=[1.0, 2.0], y=(sin, [3 + 4im])))
  (ComplexF64[1.0 + 0.0im, 2.0 + 0.0im, 3.0 + 4.0im], Restructure(NamedTuple, ..., 3))
  
  julia> re([3, 5-im, 7+11im])
  (x = [3.0, 5.0], y = (sin, ComplexF64[7.0 + 11.0im]))

Notice that “translating” (x=[1.0, 2.0], y=(sin, [3 + 4im]) into a Vector is ComplexF64[1.0 + 0.0im, 2.0 + 0.0im, 3.0 + 4.0im], that makes sense to me. But translating the vector [3, 5-im, 7+11im] back into the structure is (x = [3.0, 5.0], y = (sin, ComplexF64[7.0 + 11.0im]))., Notice the difference in the second term, 5-im becomes 5.0. The structure has a type, and things are enforced to that type. This also means that if that structure has precision Float32 (the default), and you use Float64 values in the computation, implicitly Flux will change everything internally back to Float32 without giving an explicit notification to the user.

I think there are a lot of arguments to be had about this. I understand that the reason given by the Flux devs for this is that, users may not understand that Float64 is slow on GPU, and thus Flux fixes that for the user with its type handling. I understand that, but I think a warning would have been a much better solution. I had a lot of pain over the years with people expecting the Flux version of f(x)=2x to act like a generic function, only to open up issues about non-convergence etc., and it took us a long time to find out that Flux was making our gradients Float32 in some spots and therefore we had numerical instability because of these implicit conversions which went away when we forced it to not take over the types. Yes, most people in deep learning don’t need to worry about types, but what I work in really cares about accuracy, and for that we have to worry about layers, chains, values, etc. all of it overriding defaults to work, and it leads to issues. Lux gets rid of all of that.

This is probably the worst kind of issue. Not because it’s not motivated but because it is well-motivated by the Flux team. There is a reason Flux does this, I understand the reasons for it, it’s not stupid. It’s a rational and defendable choice. But I don’t think it’s the correct choice, I think it’s unintuitive and I have had countless days of issues from both me and a large portion of my user base who have had issues with this choice. Because it’s a rational and defendable choice, that is a reason for argument. I wish the issues with Flux were something I could say was just dumb because then that’s easier to argue, but subjective user interface questions and other bikeshedding topics are definitely the worst way to lose friends because there’s no definitive answer. It’s just choice, and end user matters.

So I know I have made more than a few people angry by bringing this exact topic up so many times and I’ll leave it here. There is a reason for this choice, but I personally think Flux has made the wrong one for the general community and I think we need to go forward with the solution that more closely matches standard Julia semantics. I would go as far as to say many of the “correctness” issues brought up in the state of machine learning discussions are directly related to these choices made in Flux (and some similar choices in Zygote) and it is these pieces that has held the community back. For these reasons I have been happily moving forward with Lux and Enzyme in a way that has been much more solid than before. I don’t say this to disappoint anyone, I very much admire the Flux devs and their mission, but I personally cannot look at the large list of issues I have hit which were caused by this choice and not call it a bug, and not choose the framework that has solved it.

That being said, Flux and Lux use the same layer definitions in NNLib.jl, CUDA.jl, etc., so having this different interface doesn’t actually hurt the community as much as one might imagine. The amount of reused work is extremely high, it’s just top level interfacing and documentation that is different. So there is tons of overlap, “nothing to worry about here”, etc. in a real sense. But the closer everyone works together the more arguments you will have, these are just marital spats :sweat_smile:. But yes, I would say we really need to consider telling all new users to try Lux.jl first in 2024, not Flux.jl, and I would defend that decision deeply. And while I say that, I can easily imagine someone rationally arguing in the exact other direction. Having two reasonable decisions which both reuse the same infrastructure really isn’t that bad of a place for the ecosystem, I would just change what we consider the default.

13 Likes

I would never recommend using Lux over Flux, I find Flux much simpler and more elegant. Having parameters and states as part of the model as Flux (and pytorch!) does instead of juggling them around feels much more natural.

Consider for instance the forward pass of a hypothetical TransformerBlock in Flux:

# Flux style
function (block::TransformerBlock)(x)
    x = x + block.attn(block.ln_1(x))
    x = x + block.mlp(block.ln_2(x))
    return x
end

I haven’t played much with Lux, but I think the same forward pass in Lux would look like this instead:

# Lux style
function (block::TransformerBlock)(x, ps, st)
    z, newst = block.ln_1(x, ps.ln_1, st.ln_1)
    @reset st.ln_1 = newst
    z, newst = block.attn(z, ps.attn, st.attn)
    @reset st.attn = newst
    x = x + z 
    z, newst = block.ln_2(x, ps.ln_2, st.ln_2)
    @reset st.ln_2 = newst
    z, newst = block.mlp(x, ps.mlp, st.mlp)
    @reset st.mlp = newst
    x = x + z
    return x, st
end

I didn’t even bother to update the state, I’m not sure how much extra work that would be. Edit: update the example with the state update syntax.

Which one do you prefer?

Maybe the explicit parameter style is more convenient is some sciml use cases,
I don’t know, but we are talking about very niche applications within the deep learning world.

Regarding the promotion rules thing mentioned, yes in Flux we convert the inputs to the weights’ float type if needed (and give a warning in doing so). If you find yourself in a situation where you mixing different float types most likely you are doing something wrong, and this is why we introduced the conversion.

Is this worth building a whole new deep learning framework over? Maybe could have asked to remove the promotion a bit more strongly? Since the problem is so felt, I’m going to advocate for the removal of the promotion rules in Flux.jl.

That being said, Flux and Lux use the same layer definitions in NNLib.jl, CUDA.jl, etc., so having this different interface doesn’t actually hurt the ?community as much as one might imagine. The amount of reused work is extremely high, it’s just top level interfacing and documentation that is different

I think this is minimizing the problem. First of all, since the julia DL ecosystem is light years behind the python one and falling ever behind, if only a fraction of the amount of copying / adapting / redesigning / reimplementing that has gone into Lux had gone into Flux we would be in a slightly better place.

Also, the amount of code that is not shared is not small. And while Lux uses a bunch of packages like NNlib.jl, Optimisers.jl, Zygote.jl, MLUtils.jl that are maintained by the Flux team and the community at large, I don’t see any contribution going in the reverse direction.

For instance, I see in the readme of the LuxLib.jl repo

“Think of this package as a temporary location for functionalities that will move into NNlib.jl.”

LuxLib was created one year ago and I have yet to see any contribution going into NNlib.jl.

So my impression is that there is a tendency of Lux maintainers to develop the sciml universe without giving back to the rest of the ecosystem.
I understand that part of it is because contributing to mature packages has frictions, you have to go through reviews, discussions, and approval, it is a slower and much more constrained process than what goes on in Lux where a single guy does all of the development, creates PRs and merges them on the fly. Maybe we can do something about relaxing the constraints of the contribution process?

Bottom line, in 2024 I recommend using Flux.jl over Lux.jl, and sadly recommend using pytorch or jax over all of them. If we want to change this, I think we should avoid duplicating work, and more generally encourage julians to contribute to the DL ecosystem at large, because julia has all the potential to be a great language for deep learning but a much larger developer base is needed to make it happen.

19 Likes

In my opinion, one of the key reasons why deep learning in Julia is light years behind PyTorch/JAX is the performance and convenience of Automatic Differentiation. There are so many AD packages, and each has its own tradeoff between speed and generality. I believe we should make it easy for users to pick the one that works best for them, which explains the creation of DifferentiationInterface.jl with @hill. That way, AD packages can coexist and even compete without causing confusion for downstream users. In addition, it reduces code duplication, because every ML ecosystem (Flux, Lux, Turing, SciML) has its own variant of an Enzyme/Zygote extension with gradient bindings, and we should just pool all of those.

I’ve been chatting with various power users of AD to see how they could leverage the interface. The conclusion is that it is much easier when what you want to differentiate is a vector (or ComponenVector), and not some arbitrarily complex (callable) struct like a Flux layer. So for this reason, I think Lux.jl is more suited to easy AD integration and backend switch.
Of course it doesn’t get us all of the way there, but to me it seems like a very important step.

7 Likes

I have been wondering whether there is some sort of feature tracker for things missing e.g. compared to PyTorch similar to what Rust has at Are we learning yet. This would be helpful in finding the right areas to focus work on, and would make certain things more concrete. In this thread we have opinions ranging from „Flux and PyTorch have basically the same functionality thanks to code share“ to „they are light years apart“ or „you can‘t use GPU clusters/train large models in Flux“. Part of this might be the different times when the statements were made, but it‘s hard to track the movement if any of the domain as a whole and people who hit a brick wall maybe don’t keep retrying later. If anyone who has used both could give examples of models that can be easily built using PyTorch but just hit a brick wall of either problems or productivity in Flux, that would help a lot I think.

5 Likes

:100: agree. Not everything is SciML. Plus, the Julia community is doing many other things with DL outside scientific machine learning.

Thank you for saying it out loud @CarloLucibello , you are not alone.

BTW, does anyone know what happened with MikeInnes (original author of Flux.jl)? He is a super talented guy. Sad we lost him too.

4 Likes

SciML is one area in which Julia is in a dominant position. Many would agree that a new product doesn’t compete with old products starting with a worse version of the product to a large audience, but instead by becoming the better option in a small niche and slowly expanding that niche. IMHO even if SciML is a small niche, abandoning It would not be a good idea.

1 Like

This is mixing up two potentially different ideas, surface level syntax vs semantics. Lux and Flux have very different semantics of a function. Lux’s functions act more like any average generic Julia function, while Flux’s functions have oddly different semantics. That’s as described above and the reason for the ease of getting Lux to do other things, mentioned elsewhere like:

Exposing those semantics directly then gives a more direct surface level syntax, but that does not necessarily have to be the case. Lux for example has a compact layer API, which defines a macro for simpler surface level syntax that reinterprets the code into that necessary for the Lux semantics:

I don’t think we’ve thoroughly explored disconnecting these two. Julia syntax is great but Flux is making concessions in its design that make connections to automatic differentiation and linear algebra support (on parameters) much more difficult in order to make its surface level Julia syntax clean, but I would argue that we should instead look to make the underlying mechanisms clean and simple and fix the surface level syntax with a DSL. That would have better flexibility in the long run while possibly retaining most of what is liked about the Flux syntax.

5 Likes

I have not used either Flux or Lux, but at the end of the day this is open source. If someone wants to try out a new idea or explore a different part of the design space, they are free to do so and we should not be scolding them. Let’s not get into another argument about duplication of efforts.

16 Likes

For me the brick wall is the lack of proper GPU memory management.
It does not matter if some functionality is missing if you run out of memory on the first training iteration. And most of the building blocks already exist in Julia.

3 Likes

Unfortunately the scolding has a tendency to go both ways, from new to old packages and from old to new. Of course some of it can be expected, and I too have been guilty of that to sell my own stuff. But I think we could all use a little more patience when the package across the street doesn’t do things the way we’d like.

Outside of the development bubble, people don’t see Flux and Lux as competitors but rather as complementary options. So this is just me telling the people behind each that their efforts matter, and are recognized.

8 Likes

I will respond to the raised discussion points below, but for the record, I am disappointed that an earnest question has turned into another Flux vs. Lux back and forth. It’s perfectly possible to raise issues with a specific framework (e.g. Flux) while also stopping short of concluding that “all new users are recommended to use Lux.” We don’t need to have an antagonistic framing. For my part, I always make an effort to recommend Lux to users who will benefit from its design. I find it demoralizing to work on Flux only to have my work constantly presented as “less than.” (I should add that all contributions to Flux are unpaid labor done in our spare time).

I don’t think it is surprising that the devs for Flux and Lux respectively would build the frameworks they are currently working on. Perhaps this speaks to what’s nice about building an ML framework in Julia—we don’t need to build the foundation, and so we aren’t crippled by decisions that require refactors so large that we aren’t already doing them.

As an example, in 2021, I would have said that I would never have built Flux.params (AKA implicit parameters). And in fact, we have replaced this with explicit parameters several years ago, and we are finally fully removing implicit parameters completely after a long deprecation period. Much of the refactor work built tooling that Lux relies on to exist at all. I would call this tooling the “core” of an ML framework.

What makes an ML framework? AD, a set of fast functional operations, optimizers, and layers. AD work is split across many teams in Julia. The remaining pieces with the exception of the last one is work done by the Flux team, and apparently it is done well enough that Lux can rely on it. Even for the layers, Mike was ahead of his time in pioneering Functors.jl which allows us to manipulate deeply nested structures (I say this because it is very similar to the core of how Jax works). Again, Lux’s NamedTuple mode for parameters is possible because of the tooling in Functors. So frankly, given the enormous amount of coverage in an ML framework that we get right, I just don’t think the direction this discussion has gone really speaks to what we get wrong.

Flux’s layers behave like Julia functions in every way except the automatic conversion to Float32 of inputs. We do this for performance reasons, and we only made this choice after so responding to many users encountering these errors. Contrary to what is suggested above, we do warn users when we do this. And it is possible to avoid this conversion by explicitly saying that you want higher precision. It’s also worth pointing out that other ML frameworks make it much harder to use higher precision floating point, so there is precedence for this choice. Also, this change was made in Jan 2023 (nearly a year after the first Lux draft), so I don’t really consider it a core design decision. It’s easily reversible or changed if enough users complain about it, but that’s not the feedback we’ve gotten. I certainly wouldn’t say it is responsible for the majority of AD bugs or that it has held ML in Julia back given that it has barely existed for a year.

This is entirely orthogonal to destructure complaint. Let me add some historical context to destructure. The original implementation was this hacky loop written around Flux.params. It was brittle and would constantly silently error. It was added to Flux with no tests. When the current Flux team took over maintenance and began the transition to explicit parameters, we added a totally revamped version to Optimisers.jl with many more tests. We also made it twice-differentiable since it is used in SciML contexts. One of the challenges with building something like destructure is that re(p) actually has to rebuild the model struct which has a type for various fields. A reasonable choice is to match the type of the original struct which is a clear consistent rule. We’ve since made exceptions to this rule to fix bugs raised by SciML. The projection behavior being confusing is due to these exceptions IMO, but that’s only one aspect of what makes destructure not great.

But this new destructure arrived in April 2022 right around when Lux was finishing its first prototype. Lux’s approach completely avoids the need to rebuild the model, so it doesn’t even need to make the tricky decisions that destructure does. So, from the perspective of SciML users, this main difference is a pain point in the core design of Flux that is solved by Lux. This is why I welcome the development of Lux, and we suggest it to users who are better served by its choices. Yet, if the question is what are the pain points of deep learning libraries in Julia, this would not be something I would point to.

If I had to start from scratch, I would write explicit params Flux. Implicit params was too magical. It led to silent and hard to diagnose errors. Reasoning about a backwards pass was non-trivial even for someone who has an understanding of how Zygote works. The other thing I would do is invest all my AD effort in a low tech tool like Tracker. I agree with @gdalle that the vast majority of pain points are AD-related. So, the majority of the remaining work that Flux/Lux/anyone devs need to do is build a simple AD tool that is fast while still allowing the use cases of deep learning. Maybe Enzyme will be that tool (though it isn’t simple…but that’s not a hard requirement). Flux and Lux both work with Enzyme, and I am hopeful that DifferentiationInterface.jl makes working with Enzyme easier for all users.

Overall, most of the issues are either AD related or compiler/performance related (e.g. GPU memory consumption and GC is not great). The way Julia is set up, these are problems that lie outside of the ML core codebase.

I agree, and I don’t think I’m overstepping by saying so does the rest of the Flux team. We love that Lux exists, because it solves the problems with Flux’s design for SciML in a better way. No one expects Lux devs to contribute back to shared underlying libraries. What I object to is the constant antagonistic framing that comes up. I don’t want to get into Flux vs. Lux arguments. I want to recommend both libraries based on a user’s needs. Instead, I have to spend time correcting a new user’s misunderstanding of the situation, because other people spend time talking about our design choices as “bad for the community as a whole.” Lux’s docs still misrepresent Flux IMO.

As an example, Chris mentioned @compact earlier in the thread. This is code that was originally developed by the Flux team in Fluxperimental.jl. After we released this feature, this code was copy-pasted without attribution (I was wrong about this) into Lux.Experimental. That’s perfectly fine; this is the beauty of open source development. But it starts to become a problem when this feature is constantly brought up (not necessarily in this thread) as Lux innovation that Flux lacks. It’s annoying when it comes up in Slack threads. It’s problematic when a prominent Flux user tries to promote ML in Julia via Flux on Twitter and Lux devs respond with simplified code using @compact under the framing that this is a feature Lux has and Flux does not. IMO that presents the community in bad light in addition to insulting Flux devs.

To repeat, there’s no expectation that Lux doesn’t reuse Flux code or shared libraries. There is no expectation of attribution though it’s always nice. There is no expectation to contribute back to shared libraries though that’s also nice. All we ask is that less time is spent talking about Flux like it is a fundamentally flawed library responsible for holding back the ML community in Julia. In many cases, it is possible to celebrate what Lux does well without bringing up Flux at all.

34 Likes

Ummm… Blaming Lux.jl/src/contrib/compact.jl at main · LuxDL/Lux.jl · GitHub

I believe people should contribute in a way that’s enjoyable for them and moves our community forward most efficiently. If that’s duplicated effort, then so be it. But as far Flux libraries go, we want to make the process easier so that the non-duplicated option is the easier one when possible.

To Carlo’s point though, these discussions about the state of ML tend to focus on design or technical details. But a majority of the gap is just a lack of dev time spent. So the extent to which we can discuss how to make the non-duplicated option the easy one is worthwhile.

2 Likes

To prevent this topic from getting too heated, and allow everyone to answer calmly, I’m setting a timer of 15 minutes between messages (and may increase its duration). Hopefully the discussion can remain civil and beneficial.

12 Likes

This is I think the kind of answer here I’m looking for. I have gotten the sense from people that the automatic differentiation is kind of the real bottleneck here, just as it was with Turing.jl.

Interesting too that the explicit parameterization is the move, it works better in my intuition and I’m glad people agree to some extent.

I think I’m vaguely hopeful that enzyme ends up doing well – it’s a fantastic idea with a great team, and Julia is such a lovely testing ground for it.

Can anyone comment on ease of use and design? Like, as a person constructing a network, what pain points have you had, or what do you really like about the current interfaces?