I’m getting some odd behavior using `@threads`

and Octavian. If I time `matmul!`

and a simple `tmap!`

implementation, I see reasonable speedup.

```
using Octavian
using LinearAlgebra
A = randn(8,8)
x = randn(8,10000)
b = similar(x)
f(x) = exp(x+1) + sin(x)
function tmap!(f,out,x)
Threads.@threads for i = 1:length(x)
out[i] = f(x[i])
end
end
@btime mul!($b,$A,$x) # 14.778 μs (0 allocations: 0 bytes)
@btime matmul!($b,$A,$x) # 4.702 μs (0 allocations: 0 bytes)
@btime tmap!($f,$b,$x) # 195.632 μs (41 allocations: 3.16 KiB)
@btime map!($f,$b,$x) # 1.126 ms (0 allocations: 0 bytes)
```

However, if I put these two functions inside a function, I get drastically different timings

```
function time1(b,A,x)
matmul!(b,A,x)
tmap!(f,b,x)
end
function time2(b,A,x)
mul!(b,A,x)
tmap!(f,b,x)
end
function time3(b,A,x)
mul!(b,A,x)
map!(f,b,x)
end
@btime time1($b,$A,$x) # 42.934 ms (42 allocations: 3.19 KiB)
@btime time2($b,$A,$x) # 323.995 μs (41 allocations: 3.16 KiB)
@btime time3($b,$A,$x) # 1.404 ms (0 allocations: 0 bytes)
```

e.g., running `matmul!`

and `tmap!`

in the function `time1`

is about 200x slower than individual timings of `matmul!`

and `tmap!`

.

Can anyone explain what’s happening here?