I’m trying to do the nanoGPT tutorial by Andrej Karpathy and stuck at a little piece of code something like this:
import troch
B, T, C = 32, 8, 65
x = torch.randn(B, T, C)
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x
Everything above the last line is easy:
using Flux
B, T, C = 32, 8, 65
x = randn(B, T, C)
wei = tril(ones(T, T))
wei = wei ./ sum(wei, dims=2)
For the last line, PyTorch seems batch wei (T, T) → (B, T, T) then do a batch operation, how to achieve the same effect in Julia (repeat and broadcast seems not working, or I’m doing it wrong, batched_mul?)? Or anywhere to look things up? (my linear algebra is really rusty at this point)
For a direct translation you will need to handle the batching yourself, e.g.
xbow2 = Compat.stack(wei * x[b, :, :] for b ∈ axes(x, 1); dims = 1)
The real issue is that Julia arrays are column-major whereas Torch is row-major. Thus, Flux has the Tensor dimensions reversed as compared to Torch, i.e., the batch dimension last instead of first, and batched matrix multiplication correctly works when you also translate the tensors:
x = randn(C, T, B)
xbow2 = batched_mul(x, wei') # Note: Transpose wei as it was constructed according to Torch conventions
The real issue is that Julia arrays are column-major whereas Torch is row-major.
Yes. exactly this. I’ve noticed the difference in previous code but not yet wrap my head around it (some onehot encoding stuff, I was able to line things up as default in Julia and avoiding reshaping to direclty match torch).
Thank for pointing it out! The goal here is not to mimic torch’s behavior but to work the tutuorial out and learn something about transformer/Julia.