Hi,
I am continuing the porting of microsoft Phi model to the Transformers.jl. The most complicated is the self-attention, where I am lost, but gradually removing abstraction layers of Neuralattentionlib.jl to get to the gist. Right now, I am stucked with a following problem of computing weights in self-attention.
In Julia, I have
julia> size(query_rot_states)
(64, 32, 6)
julia> size(key_rot_states)
(64, 32, 6)
julia> attn_weights = scaled_dot_product_score(query_rot_states, key_rot_states);
julia> size(attn_weights)
(32, 32, 6)
whereas in python I have
>>> key_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> query_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> attn_weights = torch.matmul(
query_rot_states.to(torch.float32), key_rot_states.to(torch.float32).transpose(2, 3)
)/ math.sqrt(sa.head_dim)
>>> attn_weights.size()
torch.Size([1, 32, 6, 6])
What I surprised is that the attn_weights
in python have a different size (torch.Size([1, 32, 6, 6])
) then in julia ((32, 32, 6)
).
I have checked so-far that the key_rot_states
and query_rot_states
are the same between Julia and Python.
Thanks a lot for help in advance. (If @chengchingwen, I would love to know, what I am doing wrong).
Tomas