Knet vs MXNet for programmer new to ML



I’m an old programmer, but new to deep learning. I’ve done a lot of reading and simple tutorials to get a handle on the basics. I was getting ready to start a deep dive on the MXNet and Gluon documentation and tutorials with the hope that I will get a practical foundation of knowledge to build on and eventually do my own projects. I was excited to see that MXNet supported Julia, which led me to find Knet. I like the idea of a Julia based framework.

What do think about using Knet and Julia as a base to build a better understanding of the math and algorithms?

Is the Julia/Knet community active and responsive?

Are Knet/Julia’s tools mature enough that I’ll eventually be able to do real projects for clients without constantly running into limitations as compared to using something like MXNet and Python? I’m planning to ramp up over the next 5 months.

I apologize for the long post. Thanks!


The KNet and Flux community are both active and responsive and I would say there is hugh benefit in having one language (Julia) all the way down to learn Machine Learning.

I used to work with/on MXNet and I like it a lot more than let’s say Tensorflow, but I also ended up having to dive into the C++ part of MXNet to get my work done and back then I wished I could have used a pure Julia environment.

Your mileage might vary, but I would encourage you to give KNet and Flux a go!


I agree that if you want to really learn how things work you should go with either KNet or Flux. TensorFlow and MXNet have really complicated code bases. One of the wonderful things about Julia is that the need for a big, complicated machine learning framework with a huge code base pretty much disappears. I’m not too familiar with how KNet works, but Flux is pretty much just normal Julia objects collected in a convenient place. It’s really a thing of beauty next to Tensorflow and MXNet, and I hope that the machine learning community comes to love its extreme simplicity in the years to come.

Despite all the hype, deep learning (and to some extent, machine learning more generally) is based on a quite simple, and quite old idea: just take some function with a huge number of parameters and try to fit it to your conditional distribution. Naively it’s somewhat surprising that this is a viable approach, the central “miracle” behind it is explained here. The rest of the subject involves finding the appropriate ansatz to fit to (e.g. multi-layer perceptron, the whole zoo of convolutional nets, recurrent nets). You may be disappointed to learn that generally the people coming up with all the clever little variations don’t really seem to know why they work better or worse than any others (in many cases I would imagine this would require a deeper understanding of what generates the underlying distribution), so as a practical matter working on deep learning involves a lot of experimentation and just doing “whatever works”. For this you’ll probably be much happier with something like Flux than with a behemoth like Tensorflow, for which you’ll constantly need to dig through reams of documentation.

(Since discourse pinged me telling me that so many people clicked on the IB paper link, I wanted to make sure the interesting rebuttal paper kindly linked by @dave.f.kleinschmidt was also visible in the same place. In my assessment IB likely has a very significant role to play in explaining deep learning success, but it’s early days and it’s good to be aware of all the facts.)


Quite honestly, I’ve tried quite a bit of frameworks (in different languages as well) and Flux.jl code is the only code that I can confidently say I can read and understand. Everything else gives me some trouble and I have to sit down with a pen and paper to work out what the math is. So for beginners…


To me the answer depends on your primary goal.

  • If it is to understand DL beyond the basic neuron-layer abstraction then Knet is a very good choice, as it teaches you to think in terms of arrays and functions on those (instead of layers with neurons and activation functions). Especially if you don’t mind putting in extra time for implementing a bit of utility code yourself. Its a fantastic library and I use it all the time.

  • If it is to get proficient with DL quickly in order to do real projects where money is involved, then I think you will currently have an easier time doing that in python and tensorflow. I say this because tensorflow offers you very easy ways to scale up and out, to train on the cloud, to deploy your models sensibly, inspect progress with tensorboard, etc.

(I don’t know enough about mxnet or flux to give advice there)


Just my $0.02 based on personal experience with Knet and Flux over the past 4 months.

A bit of background: I started learning Julia ~5 months ago. No previous expertise with deep-learning / NNs, but a lot of experience with ML in general based on R. Probably about 6 years of intensive programming experience with R (developing packages, etc). About ~3-4 months ago I started experimenting with deep learning using Julia. The choice was to either use Python or Julia (R until recently had little to offer). I chose Julia and have been going back and forth between Knet and Flux, probably 50/50. Both have their own advantages, but I think neither is production ready. With much regret I must say that I wish I had chosen Python and am now in the process of slowly transitioning to PyTorch.

Flux: While I absolutely love the beauty and simplicity of Flux, it is not a production ready library. Hopefully it can get there one day, but I think its years away. The main problem is optimization over GPUs: with Flux it is virtually non-existent.The moment you try to use custom losses (i.e., those not included in standard examples), you are going to run into major issues. Since most of actual deep learning requires GPUs, well, you are in trouble.
Just for fun, try using Flux on GPU with a custom loss that takes something to a power > 2.0, i.e., x^3.0 or x^5.0. Good luck with that. How about a loss that requires you to simulate from some non-uniform distribution, i.e., normal? Good luck with that as well. How about using Float32 instead of Float64 (gives a big boost in performance on GPUs)? Nope, no can do. Even though these are mostly CuArrays’ issues (except for last one), it doesn’t really matter – the end story is that CuArrays is currently very lacking in terms of GPU support. Designing and optimizing GPU kernels, even with Julia magic, is a very labor intensive process (although I am no expert by any means), that’s why I say Flux is years away from being production ready. Having said that, I had a ton of fun hacking away at Flux (and continue to). Mike’s implementation is a thing of beauty, some of his Julia code is just on another level. For example, take a look at the Flux implementation of backprop with Adam optimizer – there is so much Julia magic there it hurts. Sometimes the source code can be very hard to understand (for a newbie), but its a very rewarding experience when you eventually get it.

Knet: Great library if you are trying to learn about NNs, trying to understand the design and implementation (as well as learn some Julia). Everything is low-level, down to matrices. For example, Knet’s low-level implementation of LSTMs has helped me a great deal in understanding these models. Staring and comparing Flux vs. Knet implementation of LSTMs is also super helpful. Knet is well optimized for GPUs, has a ton of examples (maybe too many) and is extensively benchmarked against Tensorflow. The problem with Knet is that it always feels like a bandaid and the code base is an absolute mess. For example, Knet uses its own implementation of the auto diff, which is just a port of the Python package. Then why not just use Python directly? Knet sort of defies the whole point of using Julia. Doing anything more advanced and custom with GPUs (beyond the basic examples) is going to get you in trouble as well. BTW, here Knet relies on its own implementation of GPU arrays, which provides a bit more support than CuArrays, but its still no enough. Try using higher-dim tensors and subsetting / slicing beyond the first index. Good luck with that. There are so many versions of Knet that it often leads to a ton of confusion. Sometimes you find support / function that is supposed to work only to discover that its no longer supported. Basically I feel like Knet is in need of a massive rewrite.

So my final suggestion would be to stick with Python, but hack around with Julia anyways, as it could be a nice learning experience.

Mixin and parameter packages: How many macros is too many macros?

Nobody would deny that Flux (and its dependencies) are in early state, but I’m a little surprised to hear you describe it as “years away” from “production ready” (which I am taking to mean “reasonably performant”). I am admittedly a casual observer but nothing about CUDA strikes me as requiring “years” of work for the additional specializations needed to fix the existing performance issues. I would be interested to hear more detail about why that might be the case if you or one of the JuliaGPU people would care to chime in about it.

On a separate note, there’s something to be said for the fact that the problem has been reduced from “write generic machine learning framework” to “write generic GPU array implementation”.


Fair enough, this is just my very subjective (and uninformed) opinion. Not trying to start a war or anything. BTW, I am not even talking about performant CUDA code here, the code just working on GPUs would already be huge. Right now, the kernels that don’t exist have to be executed on CPUs, which is nuts for any substantially large deep learning job.

I’d love to give back to the community and help in any way I can, but I am totally out of my league when it comes to CUDA. Perhaps someone can point me to an accessible guide to writing CUDA code for Julia, especially as it relates to CuArrays? For example, can someone explain in a simple, comprehensible way, how one could take a Python kernel and port it into CUDA, say for CuArrays? Please don’t point me to the much circulated blog post on this topic, its just way way too advanced. I think there are other Julia newbies who might share this sentiment, but of course, just my opinion.



The docs are pretty straightforward.


Thanks for your reply and the link to the journal article. Pretty cool!


Thanks for sharing your experience as a recent convert to Julia’s machine learning ecosystem. This is very helpful to me.


Thanks for the perspective. Keeping the end goals in mind is always important.


Several people praised Flux, so I checked it out. I love the approach! It really seems to take advantage of Julia’s strengths. Someone needs to get this in front of a deep learning team at a big company to help support its development. Once gpu support is improved this could be Julia’s killer app!


No worries, I certainly did not mean to imply that you wanted to “start a war”, or that I did. (I try not to always let it come through that I probably have a little bit of an ax to grind with Python; even though I’ll certainly admit that it has many virtues.) Clearly the things you talked about are legitimate issues that must be addressed, are there relevant issues open for the appropriate packages?

Anyway, I am also relatively uninformed about CUDA and CuArrays; but here is my impression of the overall scheme behind Flux (everyone please correct me where I go wrong!):

  1. LLVM is used to generate GPU machine code from pure Julia. This happens in CUDAnative.jl.

  2. Julia metaprogramming and code generation tools are used to generate a lot of the pedantic low-level CUDA code in kernels (e.g. addition can be written for arrays of arbitrary rank).

  3. GPU memory buffers are wrapped in Julia AbstractArray objects, this is what CuArrays.jl does. All the appropriate “nuclear” operations must be defined using the appropriate GPU kernels which can usually be written conveniently and succinctly with the steps above.

  4. Implement automatic differentiation using Grassmann numbers. If the code is generic, nothing will stop you from using them with CuArrays. This is done in ForwardDiff.jl and its dependencies.

  5. Armed with AbtractArray abstractions of blocks of GPU memory and a Julia implementation of Grassmann numbers, you now have the full machinery of Julia and multiple dispatch at your disposal. Now all you need to do is write perfrectly normal and generic Julia code that works on any AbstractArray type. If you want to do it on the GPU, just use CuArrays. In many cases you can just do this in the most naive way and it’ll work fine. A library of convenient functions for deep learning is gathered together in the form of Flux.jl, which now only needs to be beautiful, generic code.

So, to fix the slowness of the polynomial objective functions, the first thing I would do would be to figure out which CuArray operations are causing the problem. This would probably be the most complicated part of solving it, as you might have to pick apart everything that’s happening during backpropagation and gradient descent to find the culprit. Once it’s found, I really think that writing the appropriate kernels to fix it using CUDAnative is probably not going to be all that difficult (of course, your mileage may vary depending on your knowledge of CUDA). Perhaps I’m wrong, I’ve only ever casually messed around with it, but I really don’t think we are talking about any crazy complicated stuff here. The problem with Float32 is surprising to me, I don’t see any reason why CuArrays wouldn’t support this, so we’ll have to ask @MikeInnes or others. Clearly at some point we need to write a kernel for sampling from Gaussians on the GPU; but this seems like a pretty low priority to me since it’s pretty fast on a CPU (do you have to do it that many times? you really just have to do it when you initialize your weights, correct?).

Anyway, that’s just a rough overview of what I think I know from casually following this stuff. I’d love to offer more immediate help, but I’m having hardware availability issues at the moment (gaming machine with GTX 1080 is depressingly locked on Windows. I wish Valve would have thrown more of their weight around to make Linux gaming more viable so I can develop and game on one system… sigh…). Anyway, if anyone more knowledgeable wants to come along and correct me wherever I went wrong, I’m sure it would be very educational for all of us.


That doesn’t really seem like a necessity, but if some developers wanted to make some helpful PR’s and are permitted to while they’re on the company clock, I’m sure we’d all appreciate that.


I’d love to have issues for all these things, if you have time. We know there’s plenty of work to do in many of these cases, but it always helps to have people poke these things so we can track and prioritise. Some of them may also be simple setup issues that we can fix easily; e.g. Float32 support should certainly not be a problem. At this stage, all the foundational stuff should work well and be reasonably performant on GPUs, and if it’s not that’s a bug.

Optimisers are actually the part of Flux’s interface that I’m least happy with right now (which is largely why they are not documented beyond basic usage). I’m not happy that they are magical and want to redesign something that’s simple and powerful.

Mixin and parameter packages: How many macros is too many macros?

Actually this happens all the time in Bayesian deep neural networks. So for me the gaussian sampling is absolutely crucial since it’s part of every forward pass. But I may be the odd bird here. I currently run all my neural network modeling in pytorch but would love to migrate to Julia since I really like the language and Flux is indeed a beauty. Just wanted to raise a use case flag here. :blush:


Ah, that would make sense. In that case, one of this should probably work on this. I think it would be doable with existing kernels using CuArrays only, but I haven’t looked at it yet.


@ChrisRackauckas and @ExpandingMan Thanks for pointing me in the right direction. This is exactly what I was looking for!

Its not CuArrays problem, those work fine with Float32. The problem has to do with some Flux constructors, like Dense. Basically there are currently no Float32 versions of the random weight initializers. I wrote my own very hacky version that just over-writes the Flux constructors for Dense and LSTMCell, but its not elegant at all. It would be nice to just tell Flux to use Float32 throughout, rather than having to redefine the defaults of all constructors. In grand scheme of things, this is a very minor issue though.

This is a bigger issue than just weight initializers. I need Gaussian samples to define the loss. One way to think of it is, say, for each SGD update step of batch size nbatch, I need to sample a corresponding nbatch number of Gaussians. Currently this is done on CPUs. It does seem like there has been progress on this with CuArrays in last 4 days, so I will be checking it out very soon.


Thanks for replying! I’ll open issues. I think polynomial stuff is probably the most important right now. Its definitely a bug. My work around is to just use .^2 and .* to get the needed polynomial power :smile: