Speeding up my logsumexp function

That’s interesting, normally your computer smokes mine. Perhaps it’s because lsexp_mat3 is launcing too many threads? It has an extremely crude calculation which here decides that up to 1000^2 / 3636 = 275 threads would be worthwhile. (And the same in the avx case, but I guess compensated.) Keyword threads=200_000 will stop it at about 5 threads (meaning 4 or 8).

Thanks for the explanation about broadcasting. So the problem is particular to trivial dimensions which have stride 1, regardless of the other strides involved?

@avx ones(10,10)' .* rand(10,1)  # no problem
@avx ones(10,10)' .* rand(10,1)' # no problem
@avx ones(10,10)' .* rand(1,10)  # problem
@avx ones(10,10)' .* rand(1,10)' # problem
2 Likes