That’s interesting, normally your computer smokes mine. Perhaps it’s because lsexp_mat3 is launcing too many threads? It has an extremely crude calculation which here decides that up to 1000^2 / 3636 = 275 threads would be worthwhile. (And the same in the avx case, but I guess compensated.) Keyword threads=200_000 will stop it at about 5 threads (meaning 4 or 8).
Thanks for the explanation about broadcasting. So the problem is particular to trivial dimensions which have stride 1, regardless of the other strides involved?
@avx ones(10,10)' .* rand(10,1) # no problem
@avx ones(10,10)' .* rand(10,1)' # no problem
@avx ones(10,10)' .* rand(1,10) # problem
@avx ones(10,10)' .* rand(1,10)' # problem