Seems like the fastest NN library is Torch.jl which is based on C++. Is it possible for Julia to have something similarly fast? What’s preventing something as fast?
Given Julia is meant to be able to solve the two language problem, it seems possible to write something as fast as Torch.jl but in Julia.
I’m not involved in the development but it seems like a similar reason why OpenBLAS or MKL is used in julia. A bunch of well-optimized kernels are available to use, so why not use them? Whenever they are written in native Julia, it should be fairly easy to switch over. It’s good to not have NIH-syndrome (Not invented here - Wikipedia).
the two language problem exists in language “user” space. For example, Julia doesn’t solve the need for Fortran and openBLAS, or Lisp-like for parsing. But the point is “user” never touches those code (in order to be productive).
Same assumption for the content of Torch.jl, you don’t need to hack those routines to make progress in your work.
I see. So the things in Torch.jl don’t ever interact with Julia code so they don’t need to be composable hence don’t need to be in Julia to start with.
No, that’s a cop-out. And its not true either.
That’s the same excuse for numpy not being written in python.
user space is developer space.
Julia solving the two language problem means that it is able to do all those things.
You can build a BLAS or Lapack in julia (stuff that is normally done in Fortran).
It’s just why would you, when OpenBLAS exists. Solving the two language problem doesn’t mean reinventing the world.
Infact we are great at using other existing code, thats what the whole BinaryBuilder game, and all the PyCall, RCall, JavaCall, etc etc etc are all about.
Solving two language problem means you always can reinvent the world, not that you have to.
Torch.jl is exactly the same category as OpenBLAS as far as I know.
and NNPack and CuDNN which we also use.
For femtolisp based parser is a bit different. There are boot strapping problems if if you implement the parser in Julia. Who parses the parser.
I don’t think thats true. For the same reasons that it isn’t true for OpenBLAS.
If we had a julia OpenBLAS we would be able to get much faster MatMul for arrays of calars implementing julia.
and just like OpenBLAS there is work to NN Kernals implemented in julia and faster.
But why not use Torch.jl or OpenBLAS now for the cases that they do support efficiently.
This
Working out cache frieldly blocksizes and other similar microoptimizations is a lot of work so its not going to happen spontaniously.
IMO it’s a bit of a toss-up. Yes, Torch.jl obviates the need to write and maintain high-performance kernels, but the PyTorch/Julia impedance mismatch is not trivial to work around:
Row vs Column-major is a problem for any numerical array interop, as many ops work on fixed dim positions (e.g. batch-first vs batch-last in Flux). IIRC there’s still some funkiness around how size and show work for Torch.jl Tensors because output dims are often in the “wrong” order. An honourable mention for strides here as well.
Implicit vs Explicit broadcasting is a somewhat leaky abstraction in Torch.jl right now because libtorch will perform numpy-style implicit broadcasting based on operand dimensions. This means that a naive approach that passes through all ops will unexpectedly not fail when normal Julia broadcasting would, but rather return incorrect results (see point above about dim ordering). Torch.jl does override broadcast for + et al., but that scales poorly at o(ops) (note lower bound) and messes with stuff like broadcast fusion. Perhaps BroadcastStyle would help here?
CUDA interop: there is support for passing GPU buffers back and forth, but I’m not sure that things will function as expected once cross-thread or cross-process synchronization enters the picture.
General maintenance: my understanding is that a version of libtorch needs to be compiled for each major CUDA release. The primary C-C++ wrapper (from https://github.com/LaurentMazare/ocaml-torch) also doesn’t have full Tensor API coverage and thus requires writing new C shims for exposing additional functionality.
All of this is not to say that Torch.jl isn’t valuable. My primary concern is that it’s been sold as a no-string-attached solution for Flux performance problems, when in reality it (currently) has a much narrower scope and a distinct set of trade-offs.
So in theory Julia can solve the two language problem but in practice, there are well optimized tools like openblas and torch that is highly optimized, so we simplyuse them as they r faster.
Flux.jl seemed so promising a couple of years ago but has definitely lost its lustre (my friend who had been using Julia moved to Pytorch) . I tried Knet.jl but i found it impossible to comprehend even for simple cases.
I thought Julia would be great for NN but turns out that is not true as of today. I tried some NN in Julia and Pytorch and the pytorch version was faster. CPU and GPU. So i am moving to Pytorch too.
@xiaodai Same for me. I have followed Flux.jl for a long time and tried to start using it when it is as promising as PyTorch when Zygote.jl is used. Currently, I also moved to PyTorch.
Keno has been doing some promising AD work that is already performing much better than Zygote in terms of compile- and run-times in a few examples (in particular, it’s capable of nested AD which even in simple cases was too much for Zygote).
I think the better AD will be a major quality of life improvement.
I plan to start working more on the CPU side of things once I’m done with the next major release of LoopVectorization (although it’ll probably be a while), e.g. to make sure it’s using SIMD special functions. I also plan to start experimenting with things like different memory layouts and supporting different batching/threading behavior. In many cases, evaluating different batches on different threads but making each batch itself serial can give a dramatic performance boost on CPU.
The problem with Flux.jl was that it was too simple. “Oh look how simple the code is” is cute for parts, but libraries are for the hard part. Library code should be able to be easily understood, but how the internals look is not a selling point to the library itself. I think this is what had previously done incorrectly: it was too much about ascetics and not enough about actually solving the problem.
It’s moving in the right direction now, thanks in large part to @dhairyagandhi96 and @Elrod, making the standard tools of Flux be as optimal as possible. The new versions of the layers are adding @avx and all sorts things to ensure that it all does SIMD. It’s now beating FastDense, which was the DiffEqFlux version that used to be 20x faster than using Dense in a Neural ODE, which was a bit of a hack to get around the fact that we could add the performance enhancements to the Flux library. It would be nice to have SimpleDense just for show, but Dense should be as not simple as possible and even call inline assembly if that’s what it takes to have the fastest kernel. Flux now is moving in this direction, and I think actual users will enjoy the extra performance.
There are some things with this that can be optimized even more than PyTorch due to generation of fused kernels (though not necessarily TensorFlow because it fuses kernels), which is what the Torch.jl library isn’t able to capture.
This thread is arguably the best “state of Flux”/roadmap I’ve read . Using Discourse, Slack, GitHub and Zulip questions as a sample, eliminating opaque/unexpected errors with the new AD would go a long way towards improving UX. That said, there are still many architectural sharp corners in Flux that could use filing down. (e.g. anything RNN-related). It would be nice to get the model zoo out of purgatory as well so that folks actually have a working reference for anything more complex than a Chain…I created a “pre-triage” issue for it on the tracker, but perhaps that should be expedited.
Edit: the ~30s load time could use a dramatic improvement as well. TTFP was a great success, so given how many dependents Flux has reducing TTF gradient would be a significant QoL improvement as well (long CI waits come to mind).
I personally see Zygote as a nicely flexible AD tool. The kernels itself are not probably as tuned as in TF or PyTorch, but I really like the flexibility and generality.
Completely anecdotally I just ported a training loop at work from Python/PyTorch to Julia. Well, ported in the weakest sense. It’s still PyCalling into PyTorch and data loading, i.e. all the heavy operations. Still it magically became 12% faster.
No, it doesn’t make a whole lot of sense. I suspect that it somewhere in the interaction between Julia and Python has to make a copy of some array, which turns out to improve matters down the line.
(The actual plan is to port the data loading, which is custom and fairly complex, to Julia. There I’m expecting real gains.)