State of deep learning in Julia

Although Julia is promoted as an excellent language for deep learning, I still don’t see any framework I could use in production or even in long-term research. Here are the options I considered in different periods of time:

MXNet.jl

MXNet.jl is a Julia interface to the core library written in C++ (and Python?). As any wrapper, MXNet uses borrowed data structures and doesn’t feel “native”, which in practise usually means that the library doen’t work well with other common libraries. But the biggest concern is user base - I don’t see much interest to MXNet in the community, nor there’s high pace of development.

TensorFlow.jl

I remember it being quite popular a year or two ago, but just like MXNet it seems to attract quite little interest nowadays. Also, even though it includes latest innovations from the TF 2.0, I’m not sure how complete is Julia API compared to the Python’s version.

Knet.jl

To my mind this project is one of the best examples of good programming style and software management in general. Knet keeps very good backward compatibility, has excellent documentation and shows high performance. Knet comes with its weirdnesses though - I still can’t get used to wrapping everything into KnetArray and writing predict(w,x) = w[1]*x .+ w[2] instead of predict(w,b,x) = w*x .+ b. Maybe one day I will stop worrying and learn to love the API, but this day hasn’t come yet.

(There’s also more personal concern about automatic differentiation - after having designed 4 AD packages I have pretty strong opinion on how it should be done, and AutoGrad.jl doesn’t match these criteria.)

Flux.jl

Flux is the most frequently recommended framework for deep learning in Julia, howerver I don’t see it as practical for 2 reasons.

Firstly, Flux (and its underlying library NNlib) have extremely unstable API. Functions and types get removed or replaced without any deprecation or prior notice. I was the first to introduce conv2d() to NNlib (actually borrowed from Knet), but after a couple of months they were rewritten into pure Julia and renamed to just conv with new argument list (boom! my own code that depends on NNlib suddenly stopped working). 7 months ago API changed again - conv got a new required argument ConvDims. By the way, ConvDims doesn’t have any docs on how to properly construct it, and if you try to follow comments in the source code, you will find that it already doesn’t work and you should use DenseConvDims instead.

These instability spreads to Flux itself - with the latest Flux v0.8.3 large portion of Model Zoo is broken because of changed maxpool() which now requires a new argument of type PoolDims (and I’m still looking for a proper way to use it). I don’t want to blame anyone - eventully, this is you who create value - but please remember that keeping pace with the changes may be quite painful for someone not closely following the project.

Secondly, Flux is slow, and there’s very little activity to make it faster. For example, in one benchmark (Knet vs Flux etc) Flux was ~3 times slower than Knet. There’s a hope that Zygote - an upcoming AD engine - will fix it, but so far my experience is the opposite (here’s also more recent issue on performance regression compared to the current Tracker).

My latest ResNet-based siamese network in PyTorch took 3 days of training on cloud Tesla V100 to get first meaningful results. Using a framework that is 3-10 times slower is just impractical in such settings.

Please share your experience, suggestions and, if you have one, grand plan on development of deep learning infrastructure.

39 Likes

Just curious, is TensorFlow.jl still under maintenance?

Knet is probably the best package for “standard” deep learning. Flux is the easiest to make work for scientific machine learning, i.e. stuff like neural PDEs (and it’s probably the best out there right now). I tend to do the latter, so I mention Flux a lot, but I definitely agree that if you want to just make some convolutional neural networks and train it to do classification it doesn’t do as well as Knet.

Also, I didn’t like KNet’s interface before (was too low level too much of the time IMO), but it’s come a long way and I am starting to like how it looks now.

8 Likes

I had gotten the impression that Flux was totally focused on Zygote. Indeed, looking at the commits that seems to be the case. If they are fully commited to Zygote, it would make sense for them not to do any real performance work until Flux uses Zygote.

Of course, I can’t comment on the feasibility of making such a thing performant. I would be very interested to hear from any of the Flux or Zygote developers about what their plans are for performance, and whether they’ve done much preliminary testing. It was my hope that Flux would mature and become progressively more desirable, but I share the OP’s concern that I haven’t heard a peep about the bad performance situation.

3 Likes

First, a small update on Knet: Both AutoGrad and Knet have evolved significantly since the days of predict(w,x) = w[1]*x .+ w[2] – an unfortunate style forced by the original API of the python autograd (which AutoGrad.jl was based on) collecting differentiable parameters in the first argument. Starting with Knet 1.1 (released about a year ago), one can sprinkle parameters anywhere – any argument position, object field, global variable, etc. and I personally use (but not enforce) the callable object style, and the new tutorial provides many examples, e.g.

struct Linear; w; b; end
(m::Linear)(x) = m.w * x .+ m.b

A consistent API and backward compatibility are important for serious use but difficult to achieve. Since I personally use Knet for research and have models written years ago, this has been
a constant challenge for me. Components of a deep learning API include:

  1. GPU array API
  2. Operator set for deep learning specific functions/layers etc.
  3. Automatic differentiation API

I think CuArrays largely solves the first issue following the well thought out AbstractArray interface – I plan to retire KnetArrays (which are currently still faster and memory efficient) in favor of CuArrays once memory management and operator performance issues are resolved.

A remaining thorny issue is the interface for deep learning functions like conv, lstm, softmax etc. Do these belong with the array library? These are not standard functions defined in Julia base for AbstractArrays, nor are there standard CPU implementations of them (like BLAS for linear algebra) But maybe they should be – given how widespread their use has become. Maybe Julia can be the Fortran of deep learning: the first language that defines the equivalent of BLAS! (one can dream…)

I think NNlib is a good start for a set of operators but still incomplete (no RNNs, PackedSequences etc.) and unstable (I don’t think the conv interface is settled yet, for example Mike was (I think correctly) suggesting that padding should be a property of the array type, not the operation). One possible idea for a standard set of operators is to support a (version of an) external standard (ONNX, NNEF etc) – although these are also moving targets. I plan to give this a go in the near future in Knet, at least you will know that if you write a model using e.g. ONNX v7 primitives you will know that it will work in future versions (and can be imported/exported easily as a bonus).

We have talked about (and so far failed) to get some standard interface for AD with Mike and Jarrett. AD seems to be a very emotional topic for some reason and I don’t see a resolution in the near future :frowning:

43 Likes

Likewise, and particularly/especially for related memory issues:

Edit: Though this may be more in the Julia GPU purview so maybe @maleadt can comment as well?

1 Like

CuArrays already has NNlib in its dependencies and implements many (all?) its functions, so for me it’s kinda resolved question. The drawback is, again, API stability - if Knet switches to CuArrays/NNlib API, it will suffer from the same backward compatibility issues as Flux.

May I see the discussion? Although it’s not directly related to the topic, I’m curious about this suggestion since right now it sounds quite counter-intuitive.

Oh, and this discussion too, please! Is it about operator overloading vs. tracing, or static vs. dynamic graph, or…? I’m already excited! :smiley:

On memory issues: GPU memory management is challenging for several reasons:

  1. cudaMalloc is relatively expensive.
  2. The Julia GC does not feel the memory pressure on GPU.
  3. Dynamic (i.e. no static graph) deep learning has to allocate and free quite a bit of memory every training iteration.

Because of #1 we have to reuse previously allocated and garbage collected memory for performance (using a special finalizer). To give you an idea, the average cost of cudaMalloc is 0.5ms, whereas finding and reusing a preallocated block is 0.0005ms.

But the only way to reuse a preallocated block is to be sure it is no longer being used by the program, which can only be determined by Julia GC. Because of #2 we need to call Julia GC manually.The average cost is about 100ms, so you do not want to do this very often. I found that performance is very sensitive to how often one allows gc to happen and the optimal value for the minimum interval is around 2-500ms for large models I have tested.

Periodically, even gc will not be enough – especially with models that keep using different sized arrays (e.g. neural machine translation because of varying sentence size) you find that you are out of GPU memory and even if GC gives you some reusable pointers they are all of wrong size. At this point I throw in the towel, use cudaFree on every unused pointer and start over (probably suboptimal). This is the most costly option, about 250ms by itself but more because of the opportunity cost of throwing away all these pointers that could have been reused and having to cudaMalloc them.

People trying to implement large(ish) models such as the MAC-Network forced me to re-optimize the KnetArray allocator several times. It is still not as good as PyTorch/Tensorflow (in terms of total memory use) but is fast and fairly robust to the types of problems cited above for Flux/CuArrays. I have not looked carefully at the CuArrays allocator but I am sure it can be made more robust using similar tricks and performance testing on memory-hungry models.

14 Likes

I am afraid most of these were live discussions without a written record :frowning:

But we can certainly start new threads for them. @maleadt @MikeInnes and @jrevels could point to existing write-ups or help write up the relevant points. I will just elaborate on the array type issue below:

Julia has often used the strategy of defining new array types (UpperTriangular, Diagonal, LinearAlgebra.I instead of eye() etc) to e.g. make LinearAlgebra operations more efficient. Given the option of keeping some meta information with the datatype itself vs pass it as an argument to an operation, I think it makes sense to do the former. From a mathematical point of view it is conceptually clearer. From a software engineering point of view it allows you to take full advantage of Julia’s multiple dispatch and (as I learned from SICP decades ago) if you came across a random pointer (or in this case raw array) lying on the street, how would you know what to do with it?

So it is with deep learning: being padded is conceptually the property of an array, not of an operation (although cuDNN may make you think otherwise because of its clumsy C interface). Same thing with minibatched sequences for RNNs stored as PackedSequence vs PaddedSequence vs MaskedSequence etc.

12 Likes

It’s not entirely true that Flux’s API changes without warning: in Flux itself we do tend to deprecate things properly and give people time to upgrade. NNlib is a little different because it was originally designed with library use in mind, but if people are using it directly we can and should commit to more API stability. Communication is key here: if we know what issues people are running into we’ll fix those first, or at least help you figure out new APIs, add docs or deprecation warnings.

On performance: firstly, this kind of thing is really benchmark sensitive. For every microbenchmark that shows X there’s one that shows !X, and I could point to blog posts etc that find Flux much faster than TF or PyTorch for their use cases. On average all the tracing ADs have fairly similar performance IME (~1us overhead).

Obviously a big driver for Zygote is reducing AD overhead across the board. In my tests Zygote is ~10x less overhead than tracing ADs on a series of benchmarks I have (including convolutions, MLPs and RNNs). There are still performance bugs for sure, but if it was working perfectly in all cases we’d be releasing 1.0 rather than continuing the huge effort to develop it. In any case, turning one benchmark into a blanket statement and making out that there’s no effort going into these issues is pretty unreasonable.

39 Likes

Thanks for your reply!

But Flux reexports NNlib. The way I learnt it is by trying to run CNN example from model zoo, which failed because NNlib.maxpool() changed its signature.

Again, I don’t want to criticize anyone (actually I’d love to help with code!), but every time I try to use Flux or NNlib I find that something have been broken again, and so all my spare time is spent on fixing things instead of adding value.

Are these benchmarks available online? I remember several benchmarks dealing with several hundreds of parameters where Flux indeed was much faster. But I’m not really worried about training time of several minutes, the problems I (and basically all my colleges in industry) usually deal with use millions of parameters and hours or days of training. So it’s possible we are just talking about different kinds of benchmarks here.

Also in my experience performance is not something you can add later - it should be designed from the very beginning and tracked all the time. The bottlenecks (and thus optimization strategies) are also different for different tasks. For example, when you do hundreds of large matrix multiplications, 98% of time is spend in BLAS/cuBLAS, while things like type inference or compile-time dispatch, or even most optimizations compiler usually does for general-purpose code, have very little influence on performance of neural network.

As an example, when working on my very first AD package I managed to make it ~100x faster using a set of optimizations including:

  1. Rewriting most heavy operations to In-place versions (speedup of 5-10x)
  2. Using buffers for everything (another 3-4x)
  3. Common subexpression elimination (1.5-2x)
  4. Broadcasting/kernel fusion (1.2-1.5x)

All of these are easily done in tape-based, algebraic systems (e.g. single assignments, no modification, static dependencies), but Zygote doesn’t build an explicit tape, so I’m not sure how many of these optimizations can be applied (I guess 4 is already implemented, maybe 1 too, but 2 and 3 seem to be hard to do in general settings).

Which brings me to the following questions:

  1. In what uses cases Flux has the best performance in the class?
  2. How does Flux performance compare to the performance of other frameworks? Yes, my only benchmark hardly can be generalized to the whole library, but then what should I use to actually measure performance?
  3. What should I expect in the future? If, as suggested by Chris, Flux is going to be the best for scientific machine learning, I’m perfectly fine with it and wish all the best. But for “standard” deep learning I’ll have to switch to something different.

(I hope all of these doesn’t sound offensive - I don’t mean to attack any of the mentioned projects or people, but I’d very much appreciate if you could point me to the realistic options I have, because right now I can’t honestly promote Julia for ML to any of my coworkers).

12 Likes

There are two major areas of deep learning framework performance (bearing in mind that most kernels are shared between frameworks, and almost no framework does really meaningful optimisation):

  1. Autodiff (essentially a constant overhead per operation)
  2. Memory management (overheads when allocating memory).

These tend to show up in different contexts. In scalar code, or code with small tensors, AD overhead will dominate (this is what I was referring to as being typically 1us for tracing systems); for scalar code this can make your model 1000x slower than it needs to be. This is the part that Julia frameworks, and Zygote in particular, tend to be really good at (taking overhead down to about 50ns) – so in these contexts Flux will do really well. That’s common in scientific ML, as Chris points out, but not just that; NLP and RL are important areas for this as well.

For larger ResNets, for example (where convolutions cost >>1us), the dominating factor will be how well you can manage memory, and modulo future compiler optimisations we expect to match but not beat other frameworks here. Python-based frameworks tend to have an advantage since refcounting allows you to free memory really aggressively. We know that this makes it harder to run some of the larger CV models in Flux when GPU space is limited (though this could also be due to Tracker holding on to more memory than it needs to; Zygote makes it easier to debug this). In principle, as long as a model fits into memory performance should be similar (although GC thrashing can ruin performance, so that’s something you’ll want to check for). Knet, Flux, PyTorch etc. all also have different memory pooling strategies that can lead to different performance characteristics in different contexts, even within the “large convnet” space. And as if that weren’t enough, each framework arbitrarily chooses some CUDA settings that can vary performance across matrix sizes etc. Any of these factors might show up at once in a given model.

If you want to trace code and pre-allocate memory, fine, but that’s totally orthogonal to AD. You can trace through Tracker, Zygote, or Knet just fine if you want to, and use that to avoid any allocations during training. (Personally though, I think it’d be much better to use an allocation strategy that can still handle control flow, like the bump allocator in Alloc.jl)

10 Likes

I wouldn’t say they are actually orthogonal. Neural networks deal with computational graphs, and how complex these graphs are influences both - automatic differentiation and memory optimization.

Classic DL frameworks usually build a directed acyclic graph. Static or dynamic, source to source or based on operator overloading - but at the end it’s usually a DAG. These DAGs have a number of restrictions: often you can’t use mutations, need special operators for control flow and in general work only with a set of predefined operators. However this representation has its advantages. For example, if you see:

y = bar(x)
z = baz(x)
result = sum(y) + sum(z)

You know for sure that these functions have no side effects and thus can be swapped or moved to any other place to reduce memory footprint. E.g. code generator can rewrite it to:

buf .= bar(x)
s1 = sum(buf)
buf .= baz(x)
s2 = sum(buf)
result = s1 + s2

or, if bar and baz have in-place versions, the framework can further optimize it to:

bar!(buf, x)
s1 = sum(buf)
baz!(buf, x)
s2 = sum(buf)
result = s1 + s2

or, if bar and baz aren’t primitives and both call some function foo(x), the compiler of the graph can apply CSE and calculate foo(x) only once. And all of these things may be done at compile time, without any call to allocator in runtime at all.

On the other hand, general-purpose compiler of imperative language can’t do any of these: it cannot reorder bar() and baz() since they may have side effects, it generally doesn’t know about in-place versions of the functions and it cannot use CSE since bar() and baz() are compiled at different times (in Julia it’s usually a first call) and never get to the same computational graph. So I don’t really see how a compiler of a general-purpose language can optimize restricted NN computational graphs to the same degree as a dedicated engine.

But maybe I’m wrong, maybe one day it will indeed happen. The real question is: when and what should we do in the meantime? If a framework A is more powerful and will be as fast as B in 3 years, but now it’s several times slower, in practice I’ll still have to go with B.

(Note that it doesn’t imply that frameworks A or B are better for everyone, and I highly encourage any potential users to evaluate on their own all available options for their specific task.)

10 Likes

FWIW, I will be looking at memory allocator performance soon, which accounts for a large part of the performance issues that we’re seeing with larger Flux models, and will hopefully make it possible for Knet to use CuArrays as well.

20 Likes

Just because current frameworks conflate these things doesn’t mean they aren’t orthogonal. My point is that in Julia you can happily write a tracer / graph-builder that works on generic Julia code and does all the optimisations you want, then combine that with an AD (any of the ones we’ve discussed), and get the same performance as if you’d written a combined AD+optimiser like Yota.

6 Likes

My personal experience with Julia is that it seems to be performing really well when you do standard Deep learning, but I unfortunately found it completely unuseable when I tried making new experimental networks which included graph Laplacian. Flux in this case used about 10-100 times the amount of memory that pytorch did.

Other than that I really like Flux, and hope that it will be possible for me to run my Laplacian networks once Zygote is implemented.

5 Likes

Have you tried Knet.jl for that?
It seem all the spotlights are on Flux.jl but Knet.jl seems to be decent and even more without getting enough credit.
I wish it will be just a PyTorch written in Julia (Or at least heavily inspired).

3 Likes

I already have both a Flux and Pytorch implementation of my code, and I’m not really looking to make another one. Whenever I encounter something that Flux cannot do I implement it in pytorch instead (which is a shame cause I was hoping to completely move away from that, but hopefully that will come one day).

So as long as Flux and Julia seems to be actively developed on I will likely keep my code in Flux.

This seems like a trend to note and keep in mind for future Julia ML feature development / roadmap etc. >>

38 results for ERROR: Out of gpu memory here >> Search results for 'ERROR: Out of gpu memory ' - JuliaLang

Also relevant from @denizyuret’s work on large neural machine translation models based on RNNs and Transformers described here >> CuArray allocation issue: How often is it a problem? , and memory allocation errors like this **ERROR:** **Out of gpu memory** `for` `function` `vgg_m(x0,weights)` described here >> `` Gpu out of memory ;

and also noting one human genome has approximately 100GB raw data, 250GB analyzed file sizes. I think we are starting to see useful large machine learning data sets that are either completely intractable or at best cost prohibitive using anything other than high performance CPUs and relatively less expensive NVMe SSD technologies with fs compression.

For large memory models I think it is very helpful that Knet has its own mechanism for File IO , so please continue to develop Julia KnetArray memory allocator for us to be able to implement machine learning models using CPUs with NVMe SSDs and fs compression for data sets larger than GPU 11GB Memory limitations.

3 Likes