I am trying to implement the following computation on a GPU.

```
@tensor S[i,j] = A[a1,j]*F[a1,i,a2]*B[a2,j]
```

As per the strict Einstein summation convention, this is not a valid expression as the index j appears twice on the right-hand side and, as such, does not work when used with TensorOperations.jl.

At the moment, I create a larger intermediate array,

```
@tensor temporary_location[i,j,jprime] = A[a1,j]*F[a1,i,a2]*B[a2,jprime]
```

and then copy the diagonal part (i.e., temporary_location[i,j,j]) to the array S. I would like to know if there is a way to avoid performing the unnecessary computations involved in the above workaround. The range of indices a1, i, and j are of the same order (up to a few hundred), and the range of the index a2 is much smaller than a1, i, or j.

I tried using Tullio.jl, but it seems about 20-25x times slower than TensorOperations.jl.

Any help would be appreciated.