Flux Transformer Out of Memory

No and no. To resolve your confusion, we should clarify where permutedims is defined. Unlike with PyTorch where the framework defines and provides implementations for each operator, many functions you call in Flux models are defined elsewhere. In the case of permutedims, that elsewhere is actually the Julia standard library (Base). A good analogy would be if you could use numpy.transpose — NumPy v1.26 Manual instead of torch.Tensor.transpose in PyTorch and have it just work.

However, that just pushes the question upstream: why does the stdlib permutedims copy? I don’t know the correct answer, but I’ve asked around for a historical record of this decision and will update this thread if/when I receive one. The more relevant answer is that you can perform a non-copying dim transpose by using PermutedDimsArray. This is essentially what PyTorch is doing under the hood, and is also part of the Base stdlib. The biggest caveat of PermutedDimsArray is that not all user-defined functions may understand the wrapper and take the most efficient codepath. Relevant to this thread’s example, note how the last post on the second linked thread mentions NNlib’s batched_mul routines. What’s that in the docs page? PermutedDimsArray. This is precisely why I linked the definition of dot_product_attention in NNlib: it shows you how to use these tools to write an efficient attention operation which works much like the PyTorch one does.

All that said, I think your follow-up question is more or less addressed:

  1. Most of the “Flux” performance here is actually Base Julia performance and should be discussed accordingly.
  2. Direct translations are often not apples-to-apples and knowing what the idiomatic patterns are in each language (e.g. PermutedDimsArray to avoid copies) can make a big difference.
  3. Because DL frameworks have such large API surface areas and NN models vary greatly, “competitiveness” is always context sensitive. PyTorch definitely gets the most engineering effort towards optimizing its operations, but that doesn’t always translate to a performance win.
1 Like