Is it a good time for a PyTorch developer to move to Julia? If so, Flux? Knet?

I know I tend to be a little controversial but here’s my thoughts. Basically I think you are both right and wrong for different reasons. Hope you all are learning to see my criticism as just opinionated honesty - if not, you can’t please everybody :).

I think Julia is making it’s way toward it’s goal, it’ll be pretty slow for a while, but it’ll end up having a reasonable share of users doing ML. Julia is awesome for those niche things like @machineko is talking about - but that’s part of the “problem”. Those things make great blog posts, but, it’s so experimental the syntax/examples outside the docs basically change every few weeks to the point where - really only the people developing the libraries can use it. I really feel for anyone trying to do those things in production as an outsider, or trying to learn the language. The language is really easy, but, I’ve seen new comers get frustrated by 2 months out of date examples.

I think there’s a cultural hurdle. The reality is most users of other languages aren’t developers of that language or it’s low level API’s. The marketshare of developers in my opinion isn’t “use a language to fix it”, people will do that, but usually only because they have too. Julia has taken the “get more grad students” approach, which is good, it’s free labor, but at some point we need more production users. To do that, we need more things to be at that point. It’ll happen as soon as more researchers realize that there’s more acclaim in production grade tools than in arxiv papers.

I’m of the opinion that, we actually have enough people to flesh out the ecosystem, making stable libraries, etc. But, the focus isn’t on doing that. Someone a year ago or so was asking people to “reach 1.0” but I think what they meant was “why is a lot of the ecosystem a bed of sand outside of base Julia and a few core libraries?”.

Look there are dozens of examples of contributors doing this. Some really solid stuff. Yes I am guilty of the argument I am making too. Most people in industry are looking for the 80-20 they, are typically risk conservative because risk correlates to $$$. Jumping into using tools that aren’t stable, takes some courage and careful planning. The alternative - tends to be considerably easier. Rule of thumb is - people are lazy unless they are bored, and something needs to be 10x better before more people will even go beyond a 1st page google search.


Please do not generalize, I am lazy even when I am not bored.



:rofl: Good point, generalizations of humans aren’t really my cup of tea.

Flux – Ecosystem should be able to say to some of that. But I welcome folks to talk about the places we have to work on. Having said that mixed precision/ production would look somewhat different than in PyTorch/ TF because for us we make use of a lot more reuse and it is about switching the eltype in most cases. We are working on adding more tutorials about that, and I would absolutely love folks to help suggest things they struggle with and show examples of code that may not be addressed in the docs.


My biggest gripe with Flux right now(well last I used it was a couple months ago) is there is a disconnect in the training patterns available to end users. So Flux is this awesome library allowing users to do cutting edge research, super custom topologies, you name it! Great! Then when i go to train, I am pidgeon-holed into a single “train” idiom which is antithetical to this.

I’ve suggested for years that we should export the training functions and the gradient handlers so users don’t have to copy pasta core flux code or import half a dozen specific functions they intend to modify. It’s trivial - but it matters to me a lot. That said - maybe not to anyone else :).

My second gripe is basically… Gradients can die in ways that users who aren’t familiar with Zygote won’t be able to track. Sometimes it’s simple type clashes and things(don’t use the wrong division operator etc), but almost every project I’ve used it on has resulted in me in a chatroom/github at some point asking for help. A lot of those times the solution is to dev the main branch or use some old version where the issue wasn’t present, but so were other core features.

For new projects I estimate I spend maybe 3hrs doing something simple and figuring out why my gradient isn’t being tracked each time. I see this as a common peril for new users as well.

Tons of promise - but, it’s been finicky for me. For FFNN’s I usually opt to hardcode derivatives and bypass Flux and Zygote entirely. For more involved work I may lean into the toolchain because - the flexibility is astounding. Just worth mentioning it’s coming at a bit of a cost. Not a high cost - but, for people who already have a tool that offers 95% of what Flux does, it makes the effort less worth-while.


Could this answer be included in a guide blog for Julia?

Nice way to explain it to newcomers. Maybe a comment on (afaik) Juno being developed less now, due to the Atom project getting put on hold - maybe I misunderstood it at the time.

Kind regards

I’m of the somewhat extreme opinion that train! should be removed outright and Flux should only expose gradient/pullback. The one-function API looks nice on the surface, but it is horribly inflexible and obscures a lot of common errors that crop up in training loops. PyTorch seems to do just fine without one too, so it’s not like removing it will cause an unmitigated UX disaster.


This makes me feel much better - I thought it was just me. I do things to my gradients in like 80% of the models I build. Flipping signs, clipping, etc.

I don’t mind having the option of using convenience functions but I strongly prefer them to be extensible and not be things end users have to bend over backwards to accommodate when they leave the cookie cutter. At least, the rest of Flux doesn’t do that to you :).

It might even be considered more Julian, to factor out the “training” convenience functions into another package. I do appreciate having them in the ecosystem. Mostly because writing your own momentum, adam, etc is less than fruitful. But bringing those things forward as “ingredients” with documentation/examples is the way to go in my opinion.

That’s the goal of GitHub - lorenzoh/FluxTraining.jl: A flexible neural net training library inspired by and related libraries. We’ve had many a Zulip discussion about training loop design :slight_smile:


So training loop design is something very close to me, and I feel that the train! interface is somewhat simplistic by design. The under appreciated (I may be biased here :slight_smile:) fact is that the Flux API is more like a forward pass, backwards pass, optimise loop right, so that’s where the flexibility really lies. The train! sugar on top is meant for times where the simple case is what is needed. Beyond that, Training · Flux shows that adding logic to the training loop is far more straightforward than a restrictive API could actually generically provide. Providing prewritten callbacks and such would make it so that we can avoid writing glue code though.


I would be happy to include it in a blog, but I don’t have one of my own so somebody else would have to offer to let me post it on theirs.

I think I might go back and edit the post to emphasize VS code more–this has changed since earlier in the year. I could explain my workflow with Revise a bit more, which I never got around to doing, but somebody else recently posted essentially my exact workflow (with a few more points that I’m going to steal) here: I think this is all relevant to ML which uses scripts extensively IIUC.

There’s also DaemonMode.jl now which may help.

1 Like

Why do you people use zulip - it’s madness. Discord for life!

@dhairyagandhi96 - yea but what I’m saying is - just give me the ingredients I don’t want anybodies loop. :slight_smile:

Very much agreed on the philosophy of flexibility.

Regarding train!, I think @ChrisRackauckas put it best in another thread here: simplicity of use does not require simplicity of implementation. For example, is an order of magnitude longer (counting LOC between it and the first call to gradient) than Flux.train!, but you wouldn’t be able to tell that from looking at the interface. Both are arguably just as simple to use, but one allows for extensibility while the other shirks it in the name of a straightforward implementation.

I guess my heuristic would be this: given a high-level convenience function, how easy is it to do extra functionality or “desugar” it when that’s not available? IMO one hits the first quite early when using train!, which is unfortunate because a lot of beginners seem uncomfortable with doing the latter and writing a custom loop. I think we’ve all seen our fair share of questions where the asker persists in using train! and contorts their code to work with it instead of writing a plain loop, just because it feels like the “officially sanctioned” way to do things.

This is but a portion of the topics under our ML Coordination Stream on Zulip. If we had to fit this into Discord’s threading and search, I’d probably just quit posting :slight_smile: