ANN: Knet 1.4.0: accelerating CuArrays

I just released Knet 1.4.0, a major refactoring of the code (lots of submodules) for future improvements without any current API effects (hopefully).

My first goal for this release was to fully support CuArrays without any performance penalty. With generous help from @maleadt this was mostly achieved:

  • This table gives Array/KnetArray/CuArray benchmarks for two dozen operators commonly used in deep learning (defined in the Knet.Ops20 module).
  • This list shows notebooks and examples tested and marks the decreasing number that still lag in performance with CuArrays.

To get this far, Tim tweaked some array code in CUDA.jl and I handled the rest by binding CuArray functions to Knet kernels in the Knet.CuArrays module. Hopefully by further tweaking of CUDA.jl and performance improvements using CUDNN by @gartangh the Knet kernels will eventually become unnecessary.

To ease installation any dependency on GPU libraries outside of CUDA.jl was eliminated, and libknet8.so which contains Knet kernels is automatically downloaded as an artifact. If you have a GPU driver and a functioning CUDA.jl, Knet should work out of the box (with no need for CUDA compiler/toolkit installation).

My second goal for this release was to lay the groundwork for supporting multiple operator/layer/model sets with backward compatibility. I want the user to easily switch between different versions of Knet vs NNlib vs Torch operators, ONNX vs Keras layers, and have a standard interface for loading, saving, training and running state of the art models like Yolo, BERT, ResNet, GPT etc. I have a vague idea about how to do this with submodules, but we’ll see how it goes. For now I split everything in Knet into these semi-independent submodules:

  • Knet.LibKnet8: library of hand-written cuda kernels.
  • Knet.KnetArrays: the KnetArray type and its Base function implementations.
  • Knet.CuArrays: performant versions of some CuArray Base functions.
  • Knet.AutoGrad_gpu: AutoGrad support for KnetArrays and CuArrays.
  • Knet.FileIO_gpu: FileIO functions for KnetArrays (CuArrays support needed)
  • Knet.Ops20: A sample operator set, about 25 operators out of which all Knet models are currently written: conv4, pool, batchnorm, dropout, relu, nll… This module provides documentation, generic implementations and gradients.
  • Knet.Ops20_gpu: KnetArray and CuArray implementations for Knet.Ops20.
  • Knet.Train20: Model training and data manipulation functions: minibatch, adam, etc.

The idea is for users to import specific operator/layer/model submodules for their application and for me not to break these applications by implementing new submodules instead of breaking existing ones. For now v1.4.0 is exporting the exact same functions as v1.3.9 so hopefully I won’t break any existing code. Starting with v2.0.0 I will require the user to import the specific submodules they are using.

54 Likes

Knet is an amazing project.
It deserves more attention and recognition as it is ready for prime time.

Thank you for your great effort making it so good.

9 Likes

It is amazing to see the progress here, and unifying more of the Julia Deep Learning infrastructure. I think getting CUDA.jl good enough for Knet to drop the C++ kernels will be a great step towards easy installation and adoption.

-viral

9 Likes

Can I read between the lines that Knet is competitive, where “slower” not mentioned? I realize you want to keep track of that, is it however (or can be?) sometimes (much) faster? I only saw faster in two cases, but stricken out:

resnet: mode=2 is not supported for CPU pool. Knet 50% faster:
2017-Vaswani: s2s transformer Knet is 50% faster.

so assuming was a timing error.

When I (otherwise) see, I’m thinking it could be outdated (on the Knet side, or the other):

Knet Benchmarks (Sep 30, 2016)

DyNet Benchmarks (Dec 15, 2017)

KnetArrays are usually faster. I find the reasons and fix CuArrays (in Knet.jl) to catch up. That’s what the crossed lines indicate.

6 Likes

Hi,

I would be interested to know, how difficult it would to make Flux and Knet to have more similar API. For example if knet can support rules defined in chainrules, and I can easily swap the AD engine underneath. Or, if I can easily use knet arrays and their shiny kernels inside Flux.
That would be for me great and I guess it would be good for wider adoption as well. On the other hand, the amount of work would be probably terrific.
Tomas

My current plan is to apply all KnetArray speed-ups to CuArrays, so I can stop supporting a separate array type and we can all use CuArrays. Currently some of the speed-ups are due to hand-written cuda kernels Knet ships with. As an intermediate step I am redefining a few CuArray methods in the Knet.jl package to improve their speed (see here for a list). At some point these tricks may be integrated into CUDA.jl with @maleadt. Until then just including using Knet in your project will speed things up with CuArrays even if you are using Flux to define your model. (i.e. you do not have to use KnetArrays to get the speed-ups)

I am open to making Flux and Knet to have more similar APIs. As mentioned above:

So my vague plan is to make Knet support multiple APIs, so if somebody likes the operator/layer set from Flux, PyTorch, Tensorflow or ONNX, one can just include the relevant submodule and use that style (they all pretty much call the same low level functions). So it shouldn’t be too difficult to have Knet recognize all Flux ops/layers and I welcome any contributions/suggestions on this.

This is a bit harder as AD is coupled more tightly in the code compared to, say, array type. I have suggested on multiple occasions to have a standard AD interface, just like we have a standard Array interface, so people can swap AD packages more easily. However it has been tougher to get the developers of the various AD packages ([1], [2], [3], etc) on the same page (presumably because it is difficult to get a single interface that can handle the various AD mechanisms used).

On a high level note: I fully support the convergence of ML/AD/GPU tools within Julia. However my first priority is to keep Knet functioning because I, and a bunch of others, use it for our daily work. During the 5 years of Knet’s existence the Julia tools for ML/AD/GPU have been in a constant state of flux (no pun intended ;)) and have come at various levels of performance, completeness and stability. I occasionally test them and adopt the ones that I find performant, complete, and stable (e.g. CuArrays recently), and maintain custom solutions for the rest. I encourage other package maintainers to adopt Knet’s solutions for performance and completeness if they find them useful. I think once we achieve match in performance and functionality we will converge on a standard package in each of these low level domains (GPU Arrays, DL operators, AD). I also believe higher level interfaces (layers and models) are more important for wider adoption and I intend to devote more time to them.

16 Likes

Thanks a lot Deniz,

I am in a similar shoes as you. I am co-founder of https://github.com/pevnak/Mill.jl, which together with https://github.com/pevnak/JsonGrinder.jl allows to easily learn over JSONs. Since it is our workhorse, we are afraid of big changes. Nevertheless we are eager for a performance and although Flux / Zygote is super nice and easy, speed of knet is nice. We are using the framework on really big samples, therefore speed is more and more important. I was considering knet few years ago, but I was not able to add few special rules I needed, therefore I opted for Flux.

If you would find the project interesting, I would be interested to make it knet compatible.

Best wishes,
Tomas

3 Likes

Pragmatism is very important. I’d argue it is the key to have something working in practice.
Knet is great. I’d even be happier if you kept its style to be more PyTorch like and less Flux like.

1 Like

First of all, thanks Deniz for your amazing work developing Knet. :slight_smile:

I concur with Tomas that API compatibility (if not convergence) between Knet and Flux is desirable and I am delighted to see the recent progress that has been made in this direction.

More generally, I think it would be very useful to develop guidelines for developers who wish to have their ML libraries compatible with both Flux and Knet. For example, my AlphaZero.jl is compatible with both frameworks, but this comes at a high cost for both myself and users:

  • Users are currently forced to install both Knet and Flux as dependencies even if they only plan to use one framework.
  • I had to re-implement the Flux layers API in Knet (and now I have to maintain it).
  • I had to write a common interface for things such as optimisers.
  • I had to rewrite a lot of standard utilities such as DataLoaders so that they could work with both frameworks.

I think that agreeing on standard solutions to these problems would be a great step forward for the Julia ML ecosystem.

PS: Knet is about 30% faster than Flux on my Connect Four benchmark. (Edit: this figure must be 1-2 months old and things may have evolved since. I will run the benchmark again and keep you updated.)

8 Likes

Thank you Jonathan for the first successful implementation of AlphaZero in Julia (which was attempted unsuccessfully more than once before :slight_smile: )

  1. I’d like to integrate some of your work (Flux layers etc) into Knet, so the next person that wants to support both frameworks has an easier time.
  2. I’d be interested in profiling your benchmark to see if there are any low hanging performance fruits we can pick.

Happy to have a zoom call to discuss.

3 Likes

Thanks Deniz.
I am going to rerun this benchmark and make it easier to replicate.
I would also be happy to have a Zoom call with you. (As suggested by Viral, it may also be useful to include @dhairyagandhi96 and @maleadt when we discuss performances.)

2 Likes

You’re recreating the original Alpha Zero (software), or at least its clone https://github.com/leela-zero/leela-zero where I see:

Recomputing the AlphaGo Zero weights will take about 1700 years on commodity hardware.

It’s interesting to see your code has fewer lines that that C++ code (and probably the original), is it also faster (maybe because of Knet, and in general because of Julia)? Any idea how fast and good it is compared to the original (given same weights, what you can compute), so for Go or Chess?

To be clear, AlphaZero.jl is not faster than Leela Zero.

As I explain in the documentation, the aim of AlphaZero.jl is not to compete with hyper-specialized and hyper-optimized implementations such as LC0 or ELF OpenGO. These implementations are written in C++ with custom CUDA kernels and they are optimized for highly distributed computing environments. They are also very complex and therefore pretty inaccessible to students and researchers.

The philosophy of AlphaZero.jl is to provide an implementation of AlphaZero that is simple enough to be widely accessible for students and researchers, while also being sufficiently powerful and fast to enable meaningful experiments on limited computing resources. It has the simplicity of the many existing python implementations, while being consistently between one and two orders of magnitude faster.

If you find AlphaZero.jl interesting, you may be interested in the corresponding Discourse thread.

So far, I haven’t tested AlphaZero.jl on complex games such as Chess or Go. Doing well here would probably require a community effort to get enough computation power indeed. I’ve already seen a lot of interest in doing so and therefore I may publish a call to contributions soon. :wink:

7 Likes

If you’re interested, leela chess zero (and I think leela zero) make all of their training data available. If you added the ability to use their training data, you would allow an individual to train a very strong net in a few days

5 Likes

This would be interesting to try indeed!
Thanks for the tip. :slight_smile:

1 Like

Update: I ran my Connect Four benchmark again on the latest versions of Flux and Knet. Knet appears to be 1.5x faster than Flux.

Details:
I used Julia 1.5.0, Flux v0.11.1, CUDA v1.3.3 and Knet v1.4.5.
I measured how much time it takes for a randomly initialized AlphaZero agent to play 500 games of connect-four against itself:

  • Flux: 433s (including 56s spent in the GC)
  • Knet: 279s (including 92s spent in the GC)

To replicate:

git clone git@github.com:jonathan-laurent/AlphaZero.jl.git --branch flux-knet-bench
export ALPHAZERO_DEFAULT_DL_FRAMEWORK="KNET" # "FLUX"
julia --project -t 6 --color=yes scripts/profile/multithreaded.jl
7 Likes

Have you tried profiler to see, where Flux is loosing? In our application, learning over JSONs, we had identified missing sub-trees with missing. We had to replace this with “empty” subtrees to keep the type information, other Flux kept compiling, which was slow on the beginning, until we have compiled sufficient number of versions.

1 Like

Would MLJ.jl be the right vehicle? It’s meant to provide a unifying interface. But I guess lighter MLJ-like involving only NN libraries would be welcome too!

3 Likes

I haven’t managed to learn much using the profiler but @dhairyagandhi96 is looking into it. I think he already managed to reduce the gap to ~1.3x using a recent CUDA.jl PR.

1 Like