State of deep learning in Julia

dfdx · August 27, 2019, 4:05pm

Although Julia is promoted as an excellent language for deep learning, I still don’t see any framework I could use in production or even in long-term research. Here are the options I considered in different periods of time:

MXNet.jl

MXNet.jl is a Julia interface to the core library written in C++ (and Python?). As any wrapper, MXNet uses borrowed data structures and doesn’t feel “native”, which in practise usually means that the library doen’t work well with other common libraries. But the biggest concern is user base - I don’t see much interest to MXNet in the community, nor there’s high pace of development.

TensorFlow.jl

I remember it being quite popular a year or two ago, but just like MXNet it seems to attract quite little interest nowadays. Also, even though it includes latest innovations from the TF 2.0, I’m not sure how complete is Julia API compared to the Python’s version.

Knet.jl

To my mind this project is one of the best examples of good programming style and software management in general. Knet keeps very good backward compatibility, has excellent documentation and shows high performance. Knet comes with its weirdnesses though - I still can’t get used to wrapping everything into KnetArray and writing predict(w,x) = w[1]*x .+ w[2] instead of predict(w,b,x) = w*x .+ b. Maybe one day I will stop worrying and learn to love the API, but this day hasn’t come yet.

(There’s also more personal concern about automatic differentiation - after having designed 4 AD packages I have pretty strong opinion on how it should be done, and AutoGrad.jl doesn’t match these criteria.)

Flux.jl

Flux is the most frequently recommended framework for deep learning in Julia, howerver I don’t see it as practical for 2 reasons.

Firstly, Flux (and its underlying library NNlib) have extremely unstable API. Functions and types get removed or replaced without any deprecation or prior notice. I was the first to introduce conv2d() to NNlib (actually borrowed from Knet), but after a couple of months they were rewritten into pure Julia and renamed to just conv with new argument list (boom! my own code that depends on NNlib suddenly stopped working). 7 months ago API changed again - conv got a new required argument ConvDims. By the way, ConvDims doesn’t have any docs on how to properly construct it, and if you try to follow comments in the source code, you will find that it already doesn’t work and you should use DenseConvDims instead.

These instability spreads to Flux itself - with the latest Flux v0.8.3 large portion of Model Zoo is broken because of changed maxpool() which now requires a new argument of type PoolDims (and I’m still looking for a proper way to use it). I don’t want to blame anyone - eventully, this is you who create value - but please remember that keeping pace with the changes may be quite painful for someone not closely following the project.

Secondly, Flux is slow, and there’s very little activity to make it faster. For example, in one benchmark (Knet vs Flux etc) Flux was ~3 times slower than Knet. There’s a hope that Zygote - an upcoming AD engine - will fix it, but so far my experience is the opposite (here’s also more recent issue on performance regression compared to the current Tracker).

My latest ResNet-based siamese network in PyTorch took 3 days of training on cloud Tesla V100 to get first meaningful results. Using a framework that is 3-10 times slower is just impractical in such settings.

Please share your experience, suggestions and, if you have one, grand plan on development of deep learning infrastructure.

Yifan_Liu · August 27, 2019, 5:17pm

Just curious, is TensorFlow.jl still under maintenance?

ChrisRackauckas · August 27, 2019, 5:19pm

Knet is probably the best package for “standard” deep learning. Flux is the easiest to make work for scientific machine learning, i.e. stuff like neural PDEs (and it’s probably the best out there right now). I tend to do the latter, so I mention Flux a lot, but I definitely agree that if you want to just make some convolutional neural networks and train it to do classification it doesn’t do as well as Knet.

Also, I didn’t like KNet’s interface before (was too low level too much of the time IMO), but it’s come a long way and I am starting to like how it looks now.

ExpandingMan · August 27, 2019, 5:44pm

I had gotten the impression that Flux was totally focused on Zygote. Indeed, looking at the commits that seems to be the case. If they are fully commited to Zygote, it would make sense for them not to do any real performance work until Flux uses Zygote.

Of course, I can’t comment on the feasibility of making such a thing performant. I would be very interested to hear from any of the Flux or Zygote developers about what their plans are for performance, and whether they’ve done much preliminary testing. It was my hope that Flux would mature and become progressively more desirable, but I share the OP’s concern that I haven’t heard a peep about the bad performance situation.

denizyuret · August 27, 2019, 6:06pm

First, a small update on Knet: Both AutoGrad and Knet have evolved significantly since the days of predict(w,x) = w[1]*x .+ w[2] – an unfortunate style forced by the original API of the python autograd (which AutoGrad.jl was based on) collecting differentiable parameters in the first argument. Starting with Knet 1.1 (released about a year ago), one can sprinkle parameters anywhere – any argument position, object field, global variable, etc. and I personally use (but not enforce) the callable object style, and the new tutorial provides many examples, e.g.

struct Linear; w; b; end
(m::Linear)(x) = m.w * x .+ m.b

A consistent API and backward compatibility are important for serious use but difficult to achieve. Since I personally use Knet for research and have models written years ago, this has been
a constant challenge for me. Components of a deep learning API include:

GPU array API
Operator set for deep learning specific functions/layers etc.
Automatic differentiation API

I think CuArrays largely solves the first issue following the well thought out AbstractArray interface – I plan to retire KnetArrays (which are currently still faster and memory efficient) in favor of CuArrays once memory management and operator performance issues are resolved.

A remaining thorny issue is the interface for deep learning functions like conv, lstm, softmax etc. Do these belong with the array library? These are not standard functions defined in Julia base for AbstractArrays, nor are there standard CPU implementations of them (like BLAS for linear algebra) But maybe they should be – given how widespread their use has become. Maybe Julia can be the Fortran of deep learning: the first language that defines the equivalent of BLAS! (one can dream…)

I think NNlib is a good start for a set of operators but still incomplete (no RNNs, PackedSequences etc.) and unstable (I don’t think the conv interface is settled yet, for example Mike was (I think correctly) suggesting that padding should be a property of the array type, not the operation). One possible idea for a standard set of operators is to support a (version of an) external standard (ONNX, NNEF etc) – although these are also moving targets. I plan to give this a go in the near future in Knet, at least you will know that if you write a model using e.g. ONNX v7 primitives you will know that it will work in future versions (and can be imported/exported easily as a bonus).

We have talked about (and so far failed) to get some standard interface for AD with Mike and Jarrett. AD seems to be a very emotional topic for some reason and I don’t see a resolution in the near future …

datnamer · August 27, 2019, 7:25pm

Likewise, and particularly/especially for related memory issues:

github.com/JuliaGPU/CUDA.jl

Allocator very slow to reclaim memory after running for sufficiently long

opened 11:53PM - 17 Apr 19 UTC

closed 12:17PM - 02 Mar 21 UTC

aterenin

cuda array performance

This is related to the Flux issue https://github.com/FluxML/Flux.jl/issues/736 w…hich after playing around with in an interactive session, I am becoming increasingly convinced is related to strange behavior in the CuArrays memory allocator. Self-contained MWE can be found in that issue. The problem is: *after running sufficiently long*, buffers do not get reclaimed by CuArrays quickly enough, which causes the GPU to run out of memory, and performance to slow to a crawl. This is not a `Tracker` issue - I've replicated it with no `TrackedArray` types. It occurs when training MNIST with a completely standard static graph deep convolutional network with no fancy components. Example output *after just starting Julia*. ``` finished minibatch 5 Total GPU memory usage: 30.0% (2.363 GiB/7.766 GiB) CuArrays.jl pool usage: 48.0% (357.994 MiB in use by 4302 buffer(s), 810.212 MiB idle) finished minibatch 10 Total GPU memory usage: 31.0% (2.394 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (537.099 MiB in use by 5007 buffer(s), 662.265 MiB idle) finished minibatch 15 Total GPU memory usage: 35.0% (2.711 GiB/7.766 GiB) CuArrays.jl pool usage: 55.00000000000001% (1.489 GiB in use by 8782 buffer(s), 0 bytes idle) finished minibatch 20 Total GPU memory usage: 41.0% (3.146 GiB/7.766 GiB) CuArrays.jl pool usage: 61.0% (932.075 MiB in use by 6517 buffer(s), 1.014 GiB idle) finished minibatch 25 Total GPU memory usage: 41.0% (3.146 GiB/7.766 GiB) CuArrays.jl pool usage: 61.0% (1.875 GiB in use by 10292 buffer(s), 51.035 MiB idle) ──────────────────────────────────────────────── Time ────────────────────── Tot / % measured: 63.5s / 0.89% Section ncalls time %tot avg ──────────────────────────────────────────────── pooled alloc 80.2k 480ms 85.3% 5.99μs 1 try alloc 10.9k 469ms 83.3% 42.9μs background task 11 92.8ms 16.5% 8.43ms reclaim 11 17.1μs 0.00% 1.55μs scan 11 16.6μs 0.00% 1.51μs ──────────────────────────────────────────────── ``` Output *after running many epochs of Flux training* - notice how memory usage continues to grow. Timings are after 100 minibatches. ``` finished minibatch 5 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (740.112 MiB in use by 5811 buffer(s), 5.730 GiB idle) finished minibatch 10 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (1.687 GiB in use by 9586 buffer(s), 4.765 GiB idle) finished minibatch 15 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (2.651 GiB in use by 13361 buffer(s), 3.801 GiB idle) finished minibatch 20 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (3.616 GiB in use by 17136 buffer(s), 2.837 GiB idle) finished minibatch 25 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (4.580 GiB in use by 20911 buffer(s), 1.872 GiB idle) ──────────────────────────────────────────────── Time ────────────────────── Tot / % measured: 1385s / 42.4% Section ncalls time %tot avg ──────────────────────────────────────────────── pooled alloc 28.4M 295s 50.1% 10.4μs 2 gc(false) 331 289s 49.2% 874ms 1 try alloc 33.4k 1.70s 0.29% 50.8μs background task 885 294s 50.0% 332ms reclaim 885 49.7ms 0.01% 56.2μs scan 885 731μs 0.00% 826ns ──────────────────────────────────────────────── ``` Output *if `GC.gc()` is added after every 5 minibatches*. ``` finished minibatch 5 Total GPU memory usage: 27.0% (2.086 GiB/7.766 GiB) CuArrays.jl pool usage: 38.0% (255.660 MiB in use by 3967 buffer(s), 563.883 MiB idle) finished minibatch 10 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) finished minibatch 15 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) finished minibatch 20 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) finished minibatch 25 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) ``` I'd like to ask for some help in debugging this further. What can I do to further try to find out what is the issue?

github.com/FluxML/Flux.jl

Sudden memory leak when training on GPU over many epochs

opened 07:46PM - 15 Apr 19 UTC

closed 04:02PM - 15 Jun 21 UTC

aterenin

Self-contained MWE below. On my system, this will run for about 40 epochs with t…he following output from `nvidia-smi`. ``` +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 50425 C julia-1.1.0/bin/julia 1713MiB | +-----------------------------------------------------------------------------+ ``` And then suddenly will give the following output. ``` +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 44870 C julia-1.1.0/bin/julia 11997MiB | +-----------------------------------------------------------------------------+ ``` I stress that the shift in memory usage is sudden - the model will use ~2gb over many epochs, before deciding that it needs to immediately fill all available memory. Note that PIDs above are different because I killed the first process and restarted training. This behavior occurs in other networks and prevents me from training large-scale models overnight, and also prevents other people from using the same machine for their tasks. I do not know Flux/Tracker/CuArrays internals enough to speculate on what might be going on, but would love to help debug. ```julia using CuArrays using Flux using Flux: train!, onehotbatch, @epochs using MLDatasets function make_batches(data::T, batch_size::Integer)::Array{T} where T <: Tuple{AbstractArray,AbstractArray} batches = Vector{T}() idx_shuffled = Flux.Random.randperm(size(data[1])[end]) for idx in Iterators.partition(idx_shuffled, batch_size) x = selectdim(data[1], length(size(data[1])), idx) y = selectdim(data[2], length(size(data[2])), idx) push!(batches, (x,y)) end batches end data = MNIST.traindata() |> x -> (reshape(Array{Float32}(x[1]),(28,28,1,:)),x[2]) |> x -> make_batches(x,16) |> x -> map(y->(y[1], onehotbatch(y[2],0:9)),x) |> x -> map(y->gpu.(y),x) CuArrays.allowscalar(false) network = Chain( Conv((3,3),1=>16,relu;pad=(1,1)), BatchNorm(16), Conv((3,3),16=>16;pad=(1,1)), MeanPool((3,3); pad=(1,1)), Conv((3,3),16=>32,relu;pad=(1,1)), BatchNorm(32), Conv((3,3),32=>32;pad=(1,1)), MeanPool((3,3); pad=(1,1)), Conv((3,3),32=>64,relu;pad=(1,1)), BatchNorm(64), Conv((3,3),64=>64;pad=(1,1)), MeanPool((3,3); pad=(1,1)), Conv((3,3),64=>128,relu;pad=(1,1)), BatchNorm(128), Conv((3,3),128=>128;pad=(1,1)), MeanPool((4,4); pad=(1,1)), x -> reshape(x,(:,size(x)[end])), Dense(128,10), softmax ) |> gpu function loss(x::AbstractArray, y::AbstractArray)::Real y_predicted = network(x) R = 0.01f0 * sum(sum(w.^2) for w in params(network)) Flux.crossentropy(y_predicted, y) + R end optimizer = ADAM() @epochs 1000 train!(loss, params(network), data, optimizer) ```

github.com/FluxML/Flux.jl

Flux Allocates Excessively

opened 01:37AM - 01 Aug 19 UTC

jmulcahy

Flux is allocating quite a bit during calls to the AD, like in gradient. For exa…mple, on the Zygote branch with Zygote master: ``` julia> function test_zygote() t = ones(1, 1, 1, 1) f = Conv((1, 1), 1=>1) ps = params(f) Zygote.gradient(() -> sum(f(t)), ps) end test_zygote (generic function with 1 method) julia> test_zygote() Grads(...) julia> @time test_zygote() 0.000771 seconds (857 allocations: 35.641 KiB) Grads(...) ``` Bigger input doesn't sufficiently amortize the cost, either. ``` julia> function test_zygote_big() t = ones(200, 200, 1, 1) f = Conv((1, 1), 1=>1) ps = params(f) Zygote.gradient(() -> sum(f(t)), ps) end test_zygote_big (generic function with 1 method) julia> test_zygote_big() Grads(...) julia> @time test_zygote_big() 0.650714 seconds (5.48 M allocations: 191.705 MiB, 6.02% gc time) Grads(...) ``` Tracker is doing the same thing: ``` julia> function test_tracker_big() t = ones(200, 200, 1, 1) f = Conv((1, 1), 1=>1) ps = params(f) Flux.Tracker.gradient(() -> sum(f(t)), ps) end test_tracker_big (generic function with 1 method) julia> test_tracker_big() Grads(...) julia> @time test_tracker_big() 0.600915 seconds (5.36 M allocations: 189.873 MiB, 7.95% gc time) Grads(...) ``` It looks like there are some issues with type stability: ``` julia> function test_zygote_big() t = ones(200, 200, 1, 1) f = Conv((1, 1), 1=>1) ps = params(f) @code_warntype Zygote.gradient(() -> sum(f(t)), ps) end test_zygote_big (generic function with 1 method) julia> test_zygote_big() Body::Zygote.Grads 1 ─ %1 = (getfield)(args, 1)::Params │ %2 = Zygote.nothing::Nothing │ %3 = Zygote.nothing::Nothing │ %4 = %new(Zygote.Context, %2, %3)::Zygote.Context │ %5 = invoke Zygote._forward(%4::Zygote.Context, _2::getfield(Main, Symbol("##9#10")){Array{Float64,4},Conv{2,4,typeof(identity),Array{Float32,4},Array{Float32,1}}})::Tuple{Any,typeof(∂(λ))} │ %6 = (Base.getfield)(%5, 1)::Any │ %7 = (Base.getfield)(%5, 2)::typeof(∂(λ)) │ %8 = Zygote.:(##42#43)::Type{getfield(Zygote, Symbol("##42#43"))} │ %9 = (Core.typeof)(%7)::Type{#s70} where #s70<:typeof(∂(λ)) where _1 │ %10 = (Core.apply_type)(%8, Params, Zygote.Context, %9)::Type{getfield(Zygote, Symbol("##42#43")){Params,Zygote.Context,_1}} where _1 │ %11 = %new(%10, %1, %4, %7)::getfield(Zygote, Symbol("##42#43")){Params,Zygote.Context,_1} where _1 │ %12 = (Zygote.sensitivity)(%6)::Any │ %13 = (%11)(%12)::Zygote.Grads └── return %13 ``` I suspect the large number of allocations are at least a contributing factor to https://github.com/FluxML/Flux.jl/issues/736 and the corresponding CuArrays issue https://github.com/JuliaGPU/CuArrays.jl/issues/323. That problem makes it very difficult/annoying to train some large networks.

Edit: Though this may be more in the Julia GPU purview so maybe @maleadt can comment as well?

dfdx · August 27, 2019, 11:05pm

CuArrays already has NNlib in its dependencies and implements many (all?) its functions, so for me it’s kinda resolved question. The drawback is, again, API stability - if Knet switches to CuArrays/NNlib API, it will suffer from the same backward compatibility issues as Flux.

May I see the discussion? Although it’s not directly related to the topic, I’m curious about this suggestion since right now it sounds quite counter-intuitive.

Oh, and this discussion too, please! Is it about operator overloading vs. tracing, or static vs. dynamic graph, or…? I’m already excited!

denizyuret · August 28, 2019, 2:24am

On memory issues: GPU memory management is challenging for several reasons:

cudaMalloc is relatively expensive.
The Julia GC does not feel the memory pressure on GPU.
Dynamic (i.e. no static graph) deep learning has to allocate and free quite a bit of memory every training iteration.

Because of #1 we have to reuse previously allocated and garbage collected memory for performance (using a special finalizer). To give you an idea, the average cost of cudaMalloc is 0.5ms, whereas finding and reusing a preallocated block is 0.0005ms.

But the only way to reuse a preallocated block is to be sure it is no longer being used by the program, which can only be determined by Julia GC. Because of #2 we need to call Julia GC manually.The average cost is about 100ms, so you do not want to do this very often. I found that performance is very sensitive to how often one allows gc to happen and the optimal value for the minimum interval is around 2-500ms for large models I have tested.

Periodically, even gc will not be enough – especially with models that keep using different sized arrays (e.g. neural machine translation because of varying sentence size) you find that you are out of GPU memory and even if GC gives you some reusable pointers they are all of wrong size. At this point I throw in the towel, use cudaFree on every unused pointer and start over (probably suboptimal). This is the most costly option, about 250ms by itself but more because of the opportunity cost of throwing away all these pointers that could have been reused and having to cudaMalloc them.

People trying to implement large(ish) models such as the MAC-Network forced me to re-optimize the KnetArray allocator several times. It is still not as good as PyTorch/Tensorflow (in terms of total memory use) but is fast and fairly robust to the types of problems cited above for Flux/CuArrays. I have not looked carefully at the CuArrays allocator but I am sure it can be made more robust using similar tricks and performance testing on memory-hungry models.

denizyuret · August 28, 2019, 2:41am

I am afraid most of these were live discussions without a written record

But we can certainly start new threads for them. @maleadt @MikeInnes and @jrevels could point to existing write-ups or help write up the relevant points. I will just elaborate on the array type issue below:

Julia has often used the strategy of defining new array types (UpperTriangular, Diagonal, LinearAlgebra.I instead of eye() etc) to e.g. make LinearAlgebra operations more efficient. Given the option of keeping some meta information with the datatype itself vs pass it as an argument to an operation, I think it makes sense to do the former. From a mathematical point of view it is conceptually clearer. From a software engineering point of view it allows you to take full advantage of Julia’s multiple dispatch and (as I learned from SICP decades ago) if you came across a random pointer (or in this case raw array) lying on the street, how would you know what to do with it?

So it is with deep learning: being padded is conceptually the property of an array, not of an operation (although cuDNN may make you think otherwise because of its clumsy C interface). Same thing with minibatched sequences for RNNs stored as PackedSequence vs PaddedSequence vs MaskedSequence etc.

MikeInnes · August 28, 2019, 9:50am

It’s not entirely true that Flux’s API changes without warning: in Flux itself we do tend to deprecate things properly and give people time to upgrade. NNlib is a little different because it was originally designed with library use in mind, but if people are using it directly we can and should commit to more API stability. Communication is key here: if we know what issues people are running into we’ll fix those first, or at least help you figure out new APIs, add docs or deprecation warnings.

On performance: firstly, this kind of thing is really benchmark sensitive. For every microbenchmark that shows X there’s one that shows !X, and I could point to blog posts etc that find Flux much faster than TF or PyTorch for their use cases. On average all the tracing ADs have fairly similar performance IME (~1us overhead).

Obviously a big driver for Zygote is reducing AD overhead across the board. In my tests Zygote is ~10x less overhead than tracing ADs on a series of benchmarks I have (including convolutions, MLPs and RNNs). There are still performance bugs for sure, but if it was working perfectly in all cases we’d be releasing 1.0 rather than continuing the huge effort to develop it. In any case, turning one benchmark into a blanket statement and making out that there’s no effort going into these issues is pretty unreasonable.

dfdx · August 28, 2019, 1:24pm

Thanks for your reply!

But Flux reexports NNlib. The way I learnt it is by trying to run CNN example from model zoo, which failed because NNlib.maxpool() changed its signature.

Again, I don’t want to criticize anyone (actually I’d love to help with code!), but every time I try to use Flux or NNlib I find that something have been broken again, and so all my spare time is spent on fixing things instead of adding value.

Are these benchmarks available online? I remember several benchmarks dealing with several hundreds of parameters where Flux indeed was much faster. But I’m not really worried about training time of several minutes, the problems I (and basically all my colleges in industry) usually deal with use millions of parameters and hours or days of training. So it’s possible we are just talking about different kinds of benchmarks here.

Also in my experience performance is not something you can add later - it should be designed from the very beginning and tracked all the time. The bottlenecks (and thus optimization strategies) are also different for different tasks. For example, when you do hundreds of large matrix multiplications, 98% of time is spend in BLAS/cuBLAS, while things like type inference or compile-time dispatch, or even most optimizations compiler usually does for general-purpose code, have very little influence on performance of neural network.

As an example, when working on my very first AD package I managed to make it ~100x faster using a set of optimizations including:

Rewriting most heavy operations to In-place versions (speedup of 5-10x)
Using buffers for everything (another 3-4x)
Common subexpression elimination (1.5-2x)
Broadcasting/kernel fusion (1.2-1.5x)

All of these are easily done in tape-based, algebraic systems (e.g. single assignments, no modification, static dependencies), but Zygote doesn’t build an explicit tape, so I’m not sure how many of these optimizations can be applied (I guess 4 is already implemented, maybe 1 too, but 2 and 3 seem to be hard to do in general settings).

Which brings me to the following questions:

In what uses cases Flux has the best performance in the class?
How does Flux performance compare to the performance of other frameworks? Yes, my only benchmark hardly can be generalized to the whole library, but then what should I use to actually measure performance?
What should I expect in the future? If, as suggested by Chris, Flux is going to be the best for scientific machine learning, I’m perfectly fine with it and wish all the best. But for “standard” deep learning I’ll have to switch to something different.

(I hope all of these doesn’t sound offensive - I don’t mean to attack any of the mentioned projects or people, but I’d very much appreciate if you could point me to the realistic options I have, because right now I can’t honestly promote Julia for ML to any of my coworkers).

MikeInnes · August 28, 2019, 3:11pm

There are two major areas of deep learning framework performance (bearing in mind that most kernels are shared between frameworks, and almost no framework does really meaningful optimisation):

Autodiff (essentially a constant overhead per operation)
Memory management (overheads when allocating memory).

These tend to show up in different contexts. In scalar code, or code with small tensors, AD overhead will dominate (this is what I was referring to as being typically 1us for tracing systems); for scalar code this can make your model 1000x slower than it needs to be. This is the part that Julia frameworks, and Zygote in particular, tend to be really good at (taking overhead down to about 50ns) – so in these contexts Flux will do really well. That’s common in scientific ML, as Chris points out, but not just that; NLP and RL are important areas for this as well.

For larger ResNets, for example (where convolutions cost >>1us), the dominating factor will be how well you can manage memory, and modulo future compiler optimisations we expect to match but not beat other frameworks here. Python-based frameworks tend to have an advantage since refcounting allows you to free memory really aggressively. We know that this makes it harder to run some of the larger CV models in Flux when GPU space is limited (though this could also be due to Tracker holding on to more memory than it needs to; Zygote makes it easier to debug this). In principle, as long as a model fits into memory performance should be similar (although GC thrashing can ruin performance, so that’s something you’ll want to check for). Knet, Flux, PyTorch etc. all also have different memory pooling strategies that can lead to different performance characteristics in different contexts, even within the “large convnet” space. And as if that weren’t enough, each framework arbitrarily chooses some CUDA settings that can vary performance across matrix sizes etc. Any of these factors might show up at once in a given model.

If you want to trace code and pre-allocate memory, fine, but that’s totally orthogonal to AD. You can trace through Tracker, Zygote, or Knet just fine if you want to, and use that to avoid any allocations during training. (Personally though, I think it’d be much better to use an allocation strategy that can still handle control flow, like the bump allocator in Alloc.jl)

dfdx · August 28, 2019, 10:41pm

I wouldn’t say they are actually orthogonal. Neural networks deal with computational graphs, and how complex these graphs are influences both - automatic differentiation and memory optimization.

Classic DL frameworks usually build a directed acyclic graph. Static or dynamic, source to source or based on operator overloading - but at the end it’s usually a DAG. These DAGs have a number of restrictions: often you can’t use mutations, need special operators for control flow and in general work only with a set of predefined operators. However this representation has its advantages. For example, if you see:

y = bar(x)
z = baz(x)
result = sum(y) + sum(z)

You know for sure that these functions have no side effects and thus can be swapped or moved to any other place to reduce memory footprint. E.g. code generator can rewrite it to:

buf .= bar(x)
s1 = sum(buf)
buf .= baz(x)
s2 = sum(buf)
result = s1 + s2

or, if bar and baz have in-place versions, the framework can further optimize it to:

bar!(buf, x)
s1 = sum(buf)
baz!(buf, x)
s2 = sum(buf)
result = s1 + s2

or, if bar and baz aren’t primitives and both call some function foo(x), the compiler of the graph can apply CSE and calculate foo(x) only once. And all of these things may be done at compile time, without any call to allocator in runtime at all.

On the other hand, general-purpose compiler of imperative language can’t do any of these: it cannot reorder bar() and baz() since they may have side effects, it generally doesn’t know about in-place versions of the functions and it cannot use CSE since bar() and baz() are compiled at different times (in Julia it’s usually a first call) and never get to the same computational graph. So I don’t really see how a compiler of a general-purpose language can optimize restricted NN computational graphs to the same degree as a dedicated engine.

But maybe I’m wrong, maybe one day it will indeed happen. The real question is: when and what should we do in the meantime? If a framework A is more powerful and will be as fast as B in 3 years, but now it’s several times slower, in practice I’ll still have to go with B.

(Note that it doesn’t imply that frameworks A or B are better for everyone, and I highly encourage any potential users to evaluate on their own all available options for their specific task.)

maleadt · August 29, 2019, 7:10am

FWIW, I will be looking at memory allocator performance soon, which accounts for a large part of the performance issues that we’re seeing with larger Flux models, and will hopefully make it possible for Knet to use CuArrays as well.

MikeInnes · August 29, 2019, 8:37am

Just because current frameworks conflate these things doesn’t mean they aren’t orthogonal. My point is that in Julia you can happily write a tracer / graph-builder that works on generic Julia code and does all the optimisations you want, then combine that with an AD (any of the ones we’ve discussed), and get the same performance as if you’d written a combined AD+optimiser like Yota.

tue · August 31, 2019, 8:29am

My personal experience with Julia is that it seems to be performing really well when you do standard Deep learning, but I unfortunately found it completely unuseable when I tried making new experimental networks which included graph Laplacian. Flux in this case used about 10-100 times the amount of memory that pytorch did.

Other than that I really like Flux, and hope that it will be possible for me to run my Laplacian networks once Zygote is implemented.

RoyiAvital · August 31, 2019, 6:55pm

Have you tried Knet.jl for that?
It seem all the spotlights are on Flux.jl but Knet.jl seems to be decent and even more without getting enough credit.
I wish it will be just a PyTorch written in Julia (Or at least heavily inspired).

tue · August 31, 2019, 10:35pm

I already have both a Flux and Pytorch implementation of my code, and I’m not really looking to make another one. Whenever I encounter something that Flux cannot do I implement it in pytorch instead (which is a shame cause I was hoping to completely move away from that, but hopefully that will come one day).

So as long as Flux and Julia seems to be actively developed on I will likely keep my code in Flux.

Marc.Cox · September 28, 2019, 1:08am

This seems like a trend to note and keep in mind for future Julia ML feature development / roadmap etc. >>

38 results for “ERROR: Out of gpu memory” here >> Search results for 'ERROR: Out of gpu memory ' - JuliaLang

Also relevant from @denizyuret’s work on large neural machine translation models based on RNNs and Transformers described here >> CuArray allocation issue: How often is it a problem? , and memory allocation errors like this **ERROR:** **Out of gpu memory** `for` `function` `vgg_m(x0,weights)` described here >> `` Gpu out of memory ;

and also noting one human genome has approximately 100GB raw data, 250GB analyzed file sizes. I think we are starting to see useful large machine learning data sets that are either completely intractable or at best cost prohibitive using anything other than high performance CPUs and relatively less expensive NVMe SSD technologies with fs compression.

For large memory models I think it is very helpful that Knet has its own mechanism for File IO , so please continue to develop Julia KnetArray memory allocator for us to be able to implement machine learning models using CPUs with NVMe SSDs and fs compression for data sets larger than GPU 11GB Memory limitations.

Topic		Replies	Views
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6709	October 6, 2018
Flux ready for a beginner deep learning project? Machine Learning flux	31	8684	June 20, 2019
Deep learning in Julia Machine Learning	35	10481	April 22, 2024
ANN: Knet 1.4.0: accelerating CuArrays Machine Learning	26	2994	September 15, 2020
Is it a good time for a PyTorch developer to move to Julia? If so, Flux? Knet? Machine Learning	52	25030	January 11, 2021

State of deep learning in Julia

MXNet.jl

TensorFlow.jl

Knet.jl

Flux.jl

Related topics