A discrepancy in self-attention between python and Julia (Transformers)

Hi,

I am continuing the porting of microsoft Phi model to the Transformers.jl. The most complicated is the self-attention, where I am lost, but gradually removing abstraction layers of Neuralattentionlib.jl to get to the gist. Right now, I am stucked with a following problem of computing weights in self-attention.

In Julia, I have

julia> size(query_rot_states)
(64, 32, 6)

julia> size(key_rot_states)
(64, 32, 6)

julia> attn_weights = scaled_dot_product_score(query_rot_states, key_rot_states);

julia> size(attn_weights)
(32, 32, 6)

whereas in python I have

>>> key_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> query_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> attn_weights = torch.matmul(
   query_rot_states.to(torch.float32), key_rot_states.to(torch.float32).transpose(2, 3)
)/ math.sqrt(sa.head_dim)
>>> attn_weights.size()
torch.Size([1, 32, 6, 6])

What I surprised is that the attn_weights in python have a different size (torch.Size([1, 32, 6, 6])) then in julia ((32, 32, 6)).

I have checked so-far that the key_rot_states and query_rot_states are the same between Julia and Python.

Thanks a lot for help in advance. (If @chengchingwen, I would love to know, what I am doing wrong).

Tomas

2 Likes

Just a guess, but maybe 0-indexing Python needs this to be transpose(1, 2) ? (1-indexing vs. 0-indexing makes translation a bit trickier)

1 Like

But I execute python code in python, so I do not think I should change the indexes. It is exactly as here:

Tomas

Yes. It was a quick guess. But it seems the tensor dimensions are in the wrong order. The last two indices in python are the ones getting matmuled, so they need to be the sequence length and the embedding size respectively.
In Julia the first indices are the ‘fast’ indices, and so the sequence length and embedding size should be first.

I can permute the dimensions, but the results would not match (even after permutation), so there is some problem which I do not understand.

That’s the correct result. The size in Python is Size([batch_size, seq_length, num_head, head_dim]), while our Julia implementation is (head_dim, seq_length, num_head). The difference is because 1. Julia is column major while Python is row major, and 2. The Python implementation does not permute the num_head dimension and seq_length dimension while our Julia implementation permutes those dimensions earlier.

Actually, I check their code. The comments of size in their Python code is incorrect. line 354 said the size is [batch_size, seq_length, num_heads, head_dim], but you can see at line 327 where query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) that permute the q_len and self.num_heads, so the size of query_states should be actually be [batch_size, num_heads, seq_length, head_dim].

1 Like

Hi Peter,

thanks for a reply. I summarized the problem into this comment on Transformers.jl
Adding phi model · Issue #167 · chengchingwen/Transformers.jl · GitHub.
The problem is that I do not understand Neuralattention enough to make the Julia code exact to the python version