Is it a good time for a PyTorch developer to move to Julia? If so, Flux? Knet?

Julia is amazing and has the potential to be much better, easier to use and fast than pytorch. It can already do things that pytorch cannot, however I’d say the ecosystem about DL is not mature enough to give you the smooth experience you’re used to. Unless you want to dive in and help workout bugs, missing kernels, memory issues etc I would check back in after a couple of months. Once it hits this level of usability for the average user, I think it will take off.

However, if you are doing a lot with custom GPU kernels (possible in pure julia), scalar operations (much faster) and neural ODE type of things (faster and more vibrant) , Julia is already far ahead.


@Alon welcome!

Yes and Flux.
Be aware that some things still need to be polished.
I’ve personally found the benefits of Julia outweigh the costs.

The best way for you to answer the question is to take it for a test drive.
If I were you, I would download Julia, and the IDE Juno (which has auto-complete & debugger).
Then work through some Flux examples.

@Alon what do you currently use PyTorch for? Images?


doomphoenix-qxz amazing overview! thanks!
mkschleg Ratingulate Albert_Zevelev thanks so much for the insights!
I think that you can definitely add Amazing Community to the benefits of Julia.

Yeah, I’ve been doing Images (medical images) for the past few years. Unfortunately, I’m not in a position to contribute bugfixes and features at the moment so I’m looking at Julia from a regular user perspective. However, having all those potential performance benefits + IDE + debugger + a good start for a DL framework is more than enough to start some nice side project with Julia, and who knows where it will go from there :wink:

1 Like

I have also switched from Pytorch.

Within a few years I think the strengths of Julia will place it far ahead of Pytorch and others:

  • Pytorch requires underlying code to be written in c++/cuda to get the needed performance, 10x as much code to write.

  • With Flux in particular, native data types can be used. This means that you can potentially take the gradient through some existing code (say a statistics routine) that was never intended for use with Flux. To do this with Pytorch would require re-coding the equivalent python to use torch.xx data structures and calls. The potential code base for Flux is already vastly larger than for Pytorch because of this.

  • Metaprogramming. I think there is nothing like it in other languages, or definitely not in python. Nor C++. Among other things it allows creating domain specific languages e.g. JuMP and Turing I think are examples.

Multiple dispatch, unicode-latex variable names, other things are also beautiful, though in my opinion they give smaller productivity increases versus the 10x things mentioned above.

However, I did not find it effortless. There is a lot to learn, and Flux itself has changed rapidly – over the last year there was transition from the older Tracker (somewhat similar to Pytorch) to Zygote (which allows plain-datatypes as mentioned above). Some of the examples are not up to date, and I think even true for a bit of the documentation. It seems to be going fast however.

Also the Flux community seems (in my perception) to be mostly focused on differential equations, not so much on machine learning.

Because of the example+documentation problem, several people have recommended just doing github code search (extension:jl “using Flux” and sort by most recent) to see what fluent people are actually doing. This has been quite helpful.

Knet has a smaller community. It’s a partial tribute to Julia (as well as Knet+Flux authors) that these packages are potentially competitive with Pytorch, with probably 100x less person-work. As far as I know Knet’s autodiff is similar to Pytorch and does require the use of a custom array datatype, however standard operations can be used.


Hello and welcome!
I have been using Knet for the past year or so for training conv nets and I’m happy with it. It might not be as mature as Pytorch (for example it was missing the skyp connection layer, but i rolled my own quite easily) but it gets the job done. One thing where both packages lag behind a little is the ability to load ONNX models(don’t get me wrong, there are Julia packages for this, but you won’t always be able to load any model you want, most probably because of some funky layer that was not implemented yet in Knet or Flux). In that situation you might need to write your network and load data by hand, or, if you’re not in a hurry, contribute to the packages to make them better.
Back to Knet. Since it’s user base is smaller, you will find fewer answers and tutorials by google search. I recommend the documentation and the examples from github( I come from Matlab and was used to only search on google, now my mindset changed and I dig inside github repos: I usually find what I need and, as a bonus, get to look at the implementation). Yes, you won’t get answers for specific questions that fast, but the benefits are greater from my point of view.
About the debugger. I mostly use Juno with Atom I can use the graphical interface(step in, step over etc). You have a little tickbox where you can switch between interpreted mode and compiled mode(faster, but will stop in breakpoints only in the current function). If you are a bit patient you will be able to “step” your way even into the backpropagation pass of Knet, see how gradients are taken, how the optimizer updates those matrices. It’s cool! But this takes a bit of time to get accustomed to, until you understand which expressions to skip and which you should step into. I find it nice for learning purposes, but you will not need this in your day to day work since the core is quite stable.
About what you can achieve(or more like, what I achieved in my one year learning session about conv nets with sporadic efforts): I managed to roll my own networks for traffic signs classification, augment data, train on gpus. Using an already available yolov2 code I’m working onto extending this to realtime detection. And yes, I managed to train this network on my local gpu with Knet.
To summarize, there will be cases where it’s not all copy-paste-run, but the community and the learning benefits outweigh these inconveniences( valid for Knet, but also for Flux)


We added a debugger in the IDE to the VS Code extension recently as well! See here.

The main underlying debugging engine in Julia right now is JuliaInterpreter.jl. We then have three different front-ends: Juno, the VS Code extension and the REPL Debugger.jl. The three front-ends are independent of each other.

While it is great how much progress we have made with debugging in Julia, I do think it is important to point out that this is an area that is still very rough. If you come from Python and are used to some of the excellent debuggers there, then just be warned that none of the options in Julialand right now will give you an experience that is as smooth and polished and fast.


I’d just like to add that in the very short term (until some tracing and compiler stuff gets done), its you’ll often find that normal GPU heavy models will be slower, with CPU ones being faster. This is mostly just due to memory management.


Do you mean that:

  1. Models on GPU will be slower than on CPU, or
  2. Models on GPU in Julia will be slower than the same models on GPU in PyTorch, while models on CPU in Julia will be faster than models on CPU in PyTorch?

Also, can you elaborate on “some tracing and compiler stuff gets done”? I have some features pending improved performance of code compilation / tracing, so any news on that front are appreciated.

As someone who is also doing healthcare/medical research, these are early but exciting times in the Julia space! @dilumaluthge recently announced, so maybe something like MetalHead.jl or the Flux model zoo is in order.

WRT frameworks, you may want to check out @dhairyagandhi96’s Torch.jl. It’s not the full PyTorch API, but as I understand it should (eventually) allow you to use many of the kernels libtorch exposes.


Didn’t know that, but that’s excellent! Keep up the good work :smiley:

It’s not either or, see DiffEqFlux.jl: “Neural Ordinary Differential Equation (ODE)? […] This looks similar in structure to a ResNet, one of the most successful image processing models.”

The Neural Ordinary Differential Equations paper has attracted significant attention even before it was awarded one of the Best Papers of NeurIPS 2018. […]

What do differential equations have to do with machine learning?

1 Like

Yes that paper (Neural Ordinary Differential Equations) was very important and innovative. If someone put it on a list of the 20 most innovative ideas in the history of deep learning, I would not say that is out of place. It has not had much practical impact yet, but in come cases new ideas take some time to spread.

But it seems to be the one exception - among the 20000 (guess) papers published each year, there are tens or perhaps even 100 that are also important and innovative, but receive no attention here.


This estadistika Exploring-High-Level-APIs of Knet, Flux, Keras is a good side-by-side comparison, with favorable performance comparison


Please note that the model used for benchmarking is quite tiny:

The model that we are going to use is a Multilayer Perceptron with the following architecture: 4 neurons for the input layer, 10 neurons for the hidden layer, and 3 neurons for the output layer.

Most likely timing for TF is comprised mostly from initialization overhead.


Absolutely… it’s not a fair comparison. However the review in general is awesome and I love the triple-implementation in keras,knet,flux. only thing missing is Pytorch :wink:


@Alon Pytorch can be added.
QuantEcon has:
A numerical cheatsheet comparing: Matlab-Python-Julia.
A statistics cheatsheet comparing: STATA-Pandas-R.
Potential users (like you) could benefit from a deep learning cheatsheet comparing:
Knet-Flux-Keras-Pytorch …
This could have a nice place in the Flux readme?


If anyone knows pytorch well enough, it might be useful to have that as an additional comparison?

1 Like

Wow the statistics cheatsheet is great! Once DataFrames hits 1.0 we should submit a PR for this.



In the short term,

In the medium term, and future PRs building on it will allow faster and more composable Zygote, Cassette and other code transform packages built on typed IR

A side question: why does Flux.jl have dependency on the Juno.jl? Will there be any missing features of Flux if I use VSCode with Julia extension instead of Atom with Juno?