I defined a very simple function to compute the dot product, which is supposed to be equivalent to dot(x,A,y) in the package LinearAlgebra. For performance, I use Threads.@threads to make it multi-thread:

```
using BenchmarkTools, LinearAlgebra
const n=10000
x=rand(n)
A=randn(n,n)
y=rand(n)
function loop_dot(x, A, y)
s = 0.0
Threads.@threads for i = eachindex(y)
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end
```

But the benchmark

@btime loop_dot(x,A,y)

result is

```
2.720 s (200000042 allocations: 2.98 GiB)
-1397.5972007491994
```

while

@btime dot(x,A,y)

gives

```
41.147 ms (1 allocation: 16 bytes)
4172.734572614711
```

So, whatβs wrong here? Why so huge memory allocation for loop_dot(x,A,y)? Why is its result incorrect (not equal to dot(x,A,y))?