ANN: Strided.jl , a high-level dense array acceleration package

Dear Julia users,

I have recently registered a package called Strided.jl that can speed up certain tasks with dense arrays. Its primary use is when you combine arrays whose memory layout is strided but with strides that are not monotoneously increasing, e.g. the transpose/adjoint of an array, in e.g. a map or broadcast operation.

The main functionality of the package is obtained by using the macro @strided which acts as a binding contract that any arrays in the ensuing expression will have a strided memory layout. It then yields lazy views, lazy permutedims and overloads most of broadcasting, map(reduce) etc to give a speedup by using an efficient blocking strategy to loop through the different arrays.

Since these implementations are also multithreaded (using Threads.@threads) there might even be a speedup if all your arrays are just plain Arrays with monotoneously increasing strides, provided you have Threads.nthreads() > 1.

See the examples and README for further information. Beware that bugs can certainly be possible!

20 Likes

Looks great!

1 Is this meant to prototype ideas for future inclusion in Base julia (without macros) at some point, or is inclusion fundamentally impossible?

2 Is this interesting only for large arrays, or is there also a benefit in small to mid-sized arrays?

  1. That is not up to me, it’s not how this was intended but if the core developers consider this useful, then I certainly want to contribute.

  2. There is some overhead, which might be significant for small arrays. I would say, it’s not hard to benchmark a particular use case with and without @strided in front.

Warning: there certainly seem to be a few bugs, i.e. with the result of e.g. a simple reduction not being thrustworthy. I’ll have to look into this a bit further. I have mostly been using it for map! and related operations (axpy!, permutedims!, …), but not for reductions.

Strided v0.1.2 has just been tagged (and merged into metadata), which fixes the bug with certain (map)reductions. Notice though that currently, Strided.jl does not offer any benefits (in particular, not multithreading) for complete reductions, i.e. (map)reduce without a dims keyword, or with the default value dims=:. I will consider adding this soon, but it requires a somewhat different implementation.

2 Likes

This is really great, thanks! I notice I can get speedups of several even for things like simple pointwise multiplication (which I might have though was memory-limited), e.g. with 8 threads on a NERSC Cori Haswell node:

using Strided
using BenchmarkTools
A = randn(256,256);
B = similar(A);
@btime $B .= ($A .* $A);
#  40.658 μs (0 allocations: 0 bytes)
@btime @strided $B .= ($A .* $A);
#  6.216 μs (26 allocations: 3.50 KiB)

I’m wondering if you could describe briefly what is happening under the hood that makes this possible? I was under the impression Base broadcasting also tried to do smart things with threads (although I’m ignorant of what exactly that is tbh). What is Strided doing beyond that?

5 Likes

As of yet, broadcasting does not use threads.

The relevant issue for discussion is here.

1 Like

Beautiful contribution! Looking forward to more speedups in the language! :slight_smile:

Does the performance depend a lot on the architecture? I don’t get any speed up:

julia> using Strided                                                      
julia> using BenchmarkTools                                                
julia> A = randn(256,256);                                                 
julia> B = similar(A);                                                     
julia> @btime $B .= ($A .* $A);                                                      
  21.333 μs (0 allocations: 0 bytes)                              
julia> @btime @strided $B .= ($A .* $A);                                             
  21.797 μs (16 allocations: 720 bytes)              

for

julia> versioninfo()                                                                 
Julia Version 1.1.0                                                                  
Commit 80516ca202 (2019-01-21 21:24 UTC)                                             
Platform Info:                                                                       
  OS: Windows (x86_64-w64-mingw32)                                                   
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz                                      
  WORD_SIZE: 64                                                                      
  LIBM: libopenlibm                                                                  
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)   

@PetrKryslUCSD, I think you need to be using more than one thread to see a speedup here.

I dont know, it says at the top that Strided is not multithreaded…?

I get no speedup using @strided with 1 thread.
I get 3x speedup using @strided with 6 threads.

Doesn’t the OP say that it is threaded?

Yes, Strided.jl uses/exploits the built-in multithreading capabilities of Julia. Basically, given an array problem, say broadcasting or map, with one or more arrays involved, it will first divide the problem into a a number of disconnected subproblems/blocks (based on some heuristic as to which dimensions to slice up), equal to the number of threads, and then use a simple @threads for to loop over these subproblems / blocks. Normally, reduction dimensions are not sliced, as this would require temporary storage to collect the result. The only exception is a full reduction, as in that case the required temporary is small (a vector of length equal to the number of threads) and there would otherwise be no possibility to exploit more than one thread.

To see any effect, one needs to start julia with more than one thread. The only way I think this is possible is to set the environment variable JULIA_NUM_THREADS=... before Julia starts. I think Juno also sets this variable to a number larger than one (equal to the number of cores presumably).

Furthermore, very simple operations such as a plain copy! or maybe a simple addition of two arrays will likely not see any speedup, as they are indeed bandwidth limited. But if more complex operations are involved, there could be some speedup (but rarely equal to the number of threads).

3 Likes

Where does it say that? If you read this somewhere, this must have been a typo?

My apologies, I misread.

The computation with @strided hangs with more than one thread, win 10, julia 1.3.0-DEV263. Has anyone else seen this?

Disappointing performance with WSL on Windows 10:

julia> using Strided
julia> using BenchmarkTools
julia> A = randn(256,256);
julia> B = similar(A);
julia> @btime $B .= ($A .* $A);
  22.000 μs (0 allocations: 0 bytes)
julia> @btime @strided $B .= ($A .* $A);
  32.000 μs (34 allocations: 2.66 KiB)
julia> versioninfo()
Julia Version 1.3.0-DEV.348
Commit b0086f21b1 (2019-06-03 20:19 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 2

Excellent, works very well with Windows 10 itself:

julia> using Strided
julia> using BenchmarkTools
julia> A = randn(256,256);
julia> B = similar(A);
julia> @btime $B .= ($A .* $A);
  21.797 μs (0 allocations: 0 bytes)
julia> @btime @strided $B .= ($A .* $A);
  12.985 μs (20 allocations: 1.17 KiB)
julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\PetrKrysl\AppData\Local\atom\app-1.37.0\atom.exe"  -a
  JULIA_NUM_THREADS = 2

On a “real” Linux machine ;), also very nice:

julia> using Strided
julia> using BenchmarkTools
julia> A = randn(4*256,4*256);
julia> B = similar(A);
julia> @btime $B .= ($A .* $A);
  2.735 ms (0 allocations: 0 bytes)
julia> @btime @strided $B .= ($A .* $A);
  73.374 μs (30 allocations: 11.28 KiB)
julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Opteron(tm) Processor 6380
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, bdver1)
Environment:
  JULIA_NUM_THREADS = 64