It’s possible there is some type instability happening. Here is your original code in a function:

```
function foo!(B, y, By)
N = length(y)
for i in 1:N
By[i] = 0.0
for j in 1:N
By[i] += B[i, j] * y[i]
end
end
end
```

And I get:

```
julia> @time foo!(B, y, By)
0.956673 seconds
```

Now let’s make it better. First, I simply change `B[i, j]`

to `B[j, i]`

to match Julia’s column-oriented matrices. Then I have:

```
julia> @time foo!(B, y, By)
0.031369 seconds
```

Now let me rewrite it a little:

- Instead of accessing
`y[i]`

and `By[i]`

in the inner loop, I pull them out to the outer loop so only `B`

needs to be accessed in the inner loop
- You don’t need to multiply by
`y[i]`

at each iteration, just multiply the sum by it, it’s the same
- I use
`@inbounds @simd`

to enable SIMD and disable bounds checks. This requires me to be a little more strict with the types it accepts, and to have some checks at the top of the code.

The code is now

```
function foo2!(B::Matrix, y::Vector, By::Vector)
N = length(y)
size(B) == (N, N) || error()
@inbounds for i in eachindex(y, By)
Byi = 0.0
@inbounds @simd for j in 1:N
Byi += B[j, i]
end
By[i] = Byi * y[i]
end
end
```

And the timing:

```
julia> @time foo2!(B, y, By)
0.014844 seconds
```

Of course, you could also just write it simpler, like this:

```
function foo3!(B, y, By)
columns = eachcol(B)
for i in eachindex(columns, y, By)
By[i] = sum(columns[i]; init=zero(eltype(B))) * y[i]
end
end
```

However, this will be somewhat slower due to the views and the pairwise summation strategy:

```
julia> @time foo3!(B, y, By)
0.029927 seconds
```