Hi,

I am continuing the porting of microsoft Phi model to the Transformers.jl. The most complicated is the self-attention, where I am lost, but gradually removing abstraction layers of Neuralattentionlib.jl to get to the gist. Right now, I am stucked with a following problem of computing weights in self-attention.

In Julia, I have

```
julia> size(query_rot_states)
(64, 32, 6)
julia> size(key_rot_states)
(64, 32, 6)
julia> attn_weights = scaled_dot_product_score(query_rot_states, key_rot_states);
julia> size(attn_weights)
(32, 32, 6)
```

whereas in python I have

```
>>> key_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> query_rot_states.size()
torch.Size([1, 32, 6, 64])
>>> attn_weights = torch.matmul(
query_rot_states.to(torch.float32), key_rot_states.to(torch.float32).transpose(2, 3)
)/ math.sqrt(sa.head_dim)
>>> attn_weights.size()
torch.Size([1, 32, 6, 6])
```

What I surprised is that the `attn_weights`

in python have a different size (`torch.Size([1, 32, 6, 6])`

) then in julia (`(32, 32, 6)`

).

I have checked so-far that the `key_rot_states`

and `query_rot_states`

are the same between Julia and Python.

Thanks a lot for help in advance. (If @chengchingwen, I would love to know, what I am doing wrong).

Tomas