An example: reshape + view + `'`

for transpose + mul! will have some overhead.

```
using Random
using LinearAlgebra
using BenchmarkTools
buf = zeros(1000)
v1 = reshape(view(buf, 1:100), (20, 5))
v2 = view(v1, 1:16, :)
m3 = zeros(5, 5)
m4 = zeros(5, 16);
rand!(v2); rand!(m3);
m2 = zeros(16, 5); m2 .= v2
```

then test:

```
julia> @btime mul!(m4, m3, v2')
249.301 ns (1 allocation: 112 bytes)
```

```
julia> @btime mul!(m4, m3, m2')
246.672 ns (1 allocation: 16 bytes)
```

```
julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, m2, 0.0, m4)
223.786 ns (0 allocations: 0 bytes)
```

```
julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, v2, 0.0, m4)
220.938 ns (0 allocations: 0 bytes)
```

using `BLAS.gemm!`

is nice, but `mul!`

will allocate if `'`

is used for transpose, and when this happens, reshape + view will make the allocation much larger.

ps: When performing heavy calculations, Julia gc seems to be inadequate (sometimes memory peaks related to gc exist), and if we want to implement a memory pool on our own, reshape+view will arguably be the easiest way?