Flux Transformer Out of Memory

ToucheSir · March 12, 2023, 4:17pm

No and no. To resolve your confusion, we should clarify where permutedims is defined. Unlike with PyTorch where the framework defines and provides implementations for each operator, many functions you call in Flux models are defined elsewhere. In the case of permutedims, that elsewhere is actually the Julia standard library (Base). A good analogy would be if you could use numpy.transpose — NumPy v1.26 Manual instead of torch.Tensor.transpose in PyTorch and have it just work.

However, that just pushes the question upstream: why does the stdlib permutedims copy? I don’t know the correct answer, but I’ve asked around for a historical record of this decision and will update this thread if/when I receive one. The more relevant answer is that you can perform a non-copying dim transpose by using PermutedDimsArray. This is essentially what PyTorch is doing under the hood, and is also part of the Base stdlib. The biggest caveat of PermutedDimsArray is that not all user-defined functions may understand the wrapper and take the most efficient codepath. Relevant to this thread’s example, note how the last post on the second linked thread mentions NNlib’s batched_mul routines. What’s that in the docs page? PermutedDimsArray. This is precisely why I linked the definition of dot_product_attention in NNlib: it shows you how to use these tools to write an efficient attention operation which works much like the PyTorch one does.

All that said, I think your follow-up question is more or less addressed:

Most of the “Flux” performance here is actually Base Julia performance and should be discussed accordingly.
Direct translations are often not apples-to-apples and knowing what the idiomatic patterns are in each language (e.g. PermutedDimsArray to avoid copies) can make a big difference.
Because DL frameworks have such large API surface areas and NN models vary greatly, “competitiveness” is always context sensitive. PyTorch definitely gets the most engineering effort towards optimizing its operations, but that doesn’t always translate to a performance win.

Topic		Replies	Views
State of deep learning in Julia Machine Learning	18	15647	September 28, 2019
Flux vs pytorch cpu performance Machine Learning first-steps , flux	59	9235	October 2, 2020
Converting PyTorch to Flux while keeping performance Machine Learning flux , zygote	7	1422	May 29, 2022
Flux multi-cpu parallelism? New to Julia question , flux , zygote	9	2934	November 21, 2020
Flux runs out of memory Machine Learning memory-allocation , flux	25	4325	June 1, 2023

Flux Transformer Out of Memory

Related topics