Disclaimer: this is getting very much off topic and should maybe be split to a separate thread.
I am actually surprised that Strided manages to speed up such a simple operation by involving more threads. I would have thought that this operation would be mostly memory bound.
Anyway, there are two settings in which Strided can speed up array operations:
- many computations per iteration, by using multithreading
- memory unfriendly access patterns (permutdims and friends), even without threads
However, the microkernel in Strided is not very well optimized, and would certainly benefit from the work of @elrod on exploiting vectorization, to yield even further speedups. Unfortunately I don’t think I can just plug in a simple
@avx decoration in the current kernel, because at that point the operation is already disected into manually incrementing (linear) indices with appropriate strides. I am unfortunately not very familiar with the low level vector instructions and how to call them from within Julia.
Maybe some collaboration could be fruitful, @elrod?