A discrepancy in self-attention between python and Julia (Transformers)

Tomas_Pevny · January 24, 2024, 1:07pm

Hi,

I am continuing the porting of microsoft Phi model to the Transformers.jl. The most complicated is the self-attention, where I am lost, but gradually removing abstraction layers of Neuralattentionlib.jl to get to the gist. Right now, I am stucked with a following problem of computing weights in self-attention.

In Julia, I have

julia> size(query_rot_states)
(64, 32, 6)

julia> size(key_rot_states)
(64, 32, 6)

julia> attn_weights = scaled_dot_product_score(query_rot_states, key_rot_states);

julia> size(attn_weights)
(32, 32, 6)

whereas in python I have

>>> key_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> query_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> attn_weights = torch.matmul(
   query_rot_states.to(torch.float32), key_rot_states.to(torch.float32).transpose(2, 3)
)/ math.sqrt(sa.head_dim)
>>> attn_weights.size()
torch.Size([1, 32, 6, 6])

What I surprised is that the attn_weights in python have a different size (torch.Size([1, 32, 6, 6])) then in julia ((32, 32, 6)).

I have checked so-far that the key_rot_states and query_rot_states are the same between Julia and Python.

Thanks a lot for help in advance. (If @chengchingwen, I would love to know, what I am doing wrong).

Tomas

Dan · January 24, 2024, 1:19pm

Just a guess, but maybe 0-indexing Python needs this to be transpose(1, 2) ? (1-indexing vs. 0-indexing makes translation a bit trickier)

Tomas_Pevny · January 24, 2024, 1:43pm

But I execute python code in python, so I do not think I should change the indexes. It is exactly as here:

Tomas

Dan · January 24, 2024, 1:49pm

Yes. It was a quick guess. But it seems the tensor dimensions are in the wrong order. The last two indices in python are the ones getting matmuled, so they need to be the sequence length and the embedding size respectively.
In Julia the first indices are the ‘fast’ indices, and so the sequence length and embedding size should be first.

Tomas_Pevny · January 24, 2024, 2:10pm

I can permute the dimensions, but the results would not match (even after permutation), so there is some problem which I do not understand.

chengchingwen · January 25, 2024, 12:36pm

That’s the correct result. The size in Python is Size([batch_size, seq_length, num_head, head_dim]), while our Julia implementation is (head_dim, seq_length, num_head). The difference is because 1. Julia is column major while Python is row major, and 2. The Python implementation does not permute the num_head dimension and seq_length dimension while our Julia implementation permutes those dimensions earlier.

chengchingwen · January 25, 2024, 1:06pm

Actually, I check their code. The comments of size in their Python code is incorrect. line 354 said the size is [batch_size, seq_length, num_heads, head_dim], but you can see at line 327 where query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) that permute the q_len and self.num_heads, so the size of query_states should be actually be [batch_size, num_heads, seq_length, head_dim].

Tomas_Pevny · January 26, 2024, 7:27am

Hi Peter,

thanks for a reply. I summarized the problem into this comment on Transformers.jl
Adding phi model · Issue #167 · chengchingwen/Transformers.jl · GitHub.
The problem is that I do not understand Neuralattention enough to make the Julia code exact to the python version

Topic		Replies	Views
Flux Transformer Out of Memory Machine Learning	25	1607	March 13, 2023
BLAS performance issues for common neural network patterns Machine Learning question	5	1370	November 25, 2016
Why is Array rotated and flipped? New to Julia plotting , dataframes	2	832	September 11, 2018
Faulty Transpose and Adjoint Multiplication General Usage linearalgebra	28	2178	August 20, 2018
PyTorch and Julia Machine Learning	12	15412	March 27, 2019

A discrepancy in self-attention between python and Julia (Transformers)

Related topics