I have recently registered a package called Strided.jl that can speed up certain tasks with dense arrays. Its primary use is when you combine arrays whose memory layout is strided but with strides that are not monotoneously increasing, e.g. the transpose/adjoint of an array, in e.g. a map or broadcast operation.
The main functionality of the package is obtained by using the macro @strided which acts as a binding contract that any arrays in the ensuing expression will have a strided memory layout. It then yields lazy views, lazy permutedims and overloads most of broadcasting, map(reduce) etc to give a speedup by using an efficient blocking strategy to loop through the different arrays.
Since these implementations are also multithreaded (using Threads.@threads) there might even be a speedup if all your arrays are just plain Arrays with monotoneously increasing strides, provided you have Threads.nthreads() > 1.
See the examples and README for further information. Beware that bugs can certainly be possible!
That is not up to me, it’s not how this was intended but if the core developers consider this useful, then I certainly want to contribute.
There is some overhead, which might be significant for small arrays. I would say, it’s not hard to benchmark a particular use case with and without @strided in front.
Warning: there certainly seem to be a few bugs, i.e. with the result of e.g. a simple reduction not being thrustworthy. I’ll have to look into this a bit further. I have mostly been using it for map! and related operations (axpy!, permutedims!, …), but not for reductions.
Strided v0.1.2 has just been tagged (and merged into metadata), which fixes the bug with certain (map)reductions. Notice though that currently, Strided.jl does not offer any benefits (in particular, not multithreading) for complete reductions, i.e. (map)reduce without a dims keyword, or with the default value dims=:. I will consider adding this soon, but it requires a somewhat different implementation.
This is really great, thanks! I notice I can get speedups of several even for things like simple pointwise multiplication (which I might have though was memory-limited), e.g. with 8 threads on a NERSC Cori Haswell node:
using Strided
using BenchmarkTools
A = randn(256,256);
B = similar(A);
@btime $B .= ($A .* $A);
# 40.658 ÎĽs (0 allocations: 0 bytes)
@btime @strided $B .= ($A .* $A);
# 6.216 ÎĽs (26 allocations: 3.50 KiB)
I’m wondering if you could describe briefly what is happening under the hood that makes this possible? I was under the impression Base broadcasting also tried to do smart things with threads (although I’m ignorant of what exactly that is tbh). What is Strided doing beyond that?
Yes, Strided.jl uses/exploits the built-in multithreading capabilities of Julia. Basically, given an array problem, say broadcasting or map, with one or more arrays involved, it will first divide the problem into a a number of disconnected subproblems/blocks (based on some heuristic as to which dimensions to slice up), equal to the number of threads, and then use a simple @threads for to loop over these subproblems / blocks. Normally, reduction dimensions are not sliced, as this would require temporary storage to collect the result. The only exception is a full reduction, as in that case the required temporary is small (a vector of length equal to the number of threads) and there would otherwise be no possibility to exploit more than one thread.
To see any effect, one needs to start julia with more than one thread. The only way I think this is possible is to set the environment variable JULIA_NUM_THREADS=... before Julia starts. I think Juno also sets this variable to a number larger than one (equal to the number of cores presumably).
Furthermore, very simple operations such as a plain copy! or maybe a simple addition of two arrays will likely not see any speedup, as they are indeed bandwidth limited. But if more complex operations are involved, there could be some speedup (but rarely equal to the number of threads).