ANN: Strided.jl , a high-level dense array acceleration package

juthohaegeman · October 17, 2018, 8:45am

Dear Julia users,

I have recently registered a package called Strided.jl that can speed up certain tasks with dense arrays. Its primary use is when you combine arrays whose memory layout is strided but with strides that are not monotoneously increasing, e.g. the transpose/adjoint of an array, in e.g. a map or broadcast operation.

The main functionality of the package is obtained by using the macro @strided which acts as a binding contract that any arrays in the ensuing expression will have a strided memory layout. It then yields lazy views, lazy permutedims and overloads most of broadcasting, map(reduce) etc to give a speedup by using an efficient blocking strategy to loop through the different arrays.

Since these implementations are also multithreaded (using Threads.@threads) there might even be a speedup if all your arrays are just plain Arrays with monotoneously increasing strides, provided you have Threads.nthreads() > 1.

See the examples and README for further information. Beware that bugs can certainly be possible!

antoine-levitt · October 17, 2018, 11:50am

Looks great!

1 Is this meant to prototype ideas for future inclusion in Base julia (without macros) at some point, or is inclusion fundamentally impossible?

2 Is this interesting only for large arrays, or is there also a benefit in small to mid-sized arrays?

juthohaegeman · October 17, 2018, 12:58pm

That is not up to me, it’s not how this was intended but if the core developers consider this useful, then I certainly want to contribute.
There is some overhead, which might be significant for small arrays. I would say, it’s not hard to benchmark a particular use case with and without @strided in front.

juthohaegeman · October 17, 2018, 1:08pm

Warning: there certainly seem to be a few bugs, i.e. with the result of e.g. a simple reduction not being thrustworthy. I’ll have to look into this a bit further. I have mostly been using it for map! and related operations (axpy!, permutedims!, …), but not for reductions.

juthohaegeman · October 17, 2018, 9:51pm

Strided v0.1.2 has just been tagged (and merged into metadata), which fixes the bug with certain (map)reductions. Notice though that currently, Strided.jl does not offer any benefits (in particular, not multithreading) for complete reductions, i.e. (map)reduce without a dims keyword, or with the default value dims=:. I will consider adding this soon, but it requires a somewhat different implementation.

marius311 · May 21, 2019, 10:50pm

This is really great, thanks! I notice I can get speedups of several even for things like simple pointwise multiplication (which I might have though was memory-limited), e.g. with 8 threads on a NERSC Cori Haswell node:

using Strided
using BenchmarkTools
A = randn(256,256);
B = similar(A);
@btime $B .= ($A .* $A);
#  40.658 μs (0 allocations: 0 bytes)
@btime @strided $B .= ($A .* $A);
#  6.216 μs (26 allocations: 3.50 KiB)

I’m wondering if you could describe briefly what is happening under the hood that makes this possible? I was under the impression Base broadcasting also tried to do smart things with threads (although I’m ignorant of what exactly that is tbh). What is Strided doing beyond that?

jebej · May 22, 2019, 12:13am

As of yet, broadcasting does not use threads.

The relevant issue for discussion is here.

juliohm · May 22, 2019, 12:52am

Beautiful contribution! Looking forward to more speedups in the language!

PetrKryslUCSD · May 22, 2019, 2:02am

Does the performance depend a lot on the architecture? I don’t get any speed up:

julia> using Strided                                                      
julia> using BenchmarkTools                                                
julia> A = randn(256,256);                                                 
julia> B = similar(A);                                                     
julia> @btime $B .= ($A .* $A);                                                      
  21.333 μs (0 allocations: 0 bytes)                              
julia> @btime @strided $B .= ($A .* $A);                                             
  21.797 μs (16 allocations: 720 bytes)

for

julia> versioninfo()                                                                 
Julia Version 1.1.0                                                                  
Commit 80516ca202 (2019-01-21 21:24 UTC)                                             
Platform Info:                                                                       
  OS: Windows (x86_64-w64-mingw32)                                                   
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz                                      
  WORD_SIZE: 64                                                                      
  LIBM: libopenlibm                                                                  
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

tshort · May 22, 2019, 2:17am

@PetrKryslUCSD, I think you need to be using more than one thread to see a speedup here.

PetrKryslUCSD · May 22, 2019, 2:26am

I dont know, it says at the top that Strided is not multithreaded…?

JeffreySarnoff · May 22, 2019, 3:51am

I get no speedup using @strided with 1 thread.
I get 3x speedup using @strided with 6 threads.

baggepinnen · May 22, 2019, 4:24am

Doesn’t the OP say that it is threaded?

juthohaegeman · May 22, 2019, 6:59am

Yes, Strided.jl uses/exploits the built-in multithreading capabilities of Julia. Basically, given an array problem, say broadcasting or map, with one or more arrays involved, it will first divide the problem into a a number of disconnected subproblems/blocks (based on some heuristic as to which dimensions to slice up), equal to the number of threads, and then use a simple @threads for to loop over these subproblems / blocks. Normally, reduction dimensions are not sliced, as this would require temporary storage to collect the result. The only exception is a full reduction, as in that case the required temporary is small (a vector of length equal to the number of threads) and there would otherwise be no possibility to exploit more than one thread.

To see any effect, one needs to start julia with more than one thread. The only way I think this is possible is to set the environment variable JULIA_NUM_THREADS=... before Julia starts. I think Juno also sets this variable to a number larger than one (equal to the number of cores presumably).

Furthermore, very simple operations such as a plain copy! or maybe a simple addition of two arrays will likely not see any speedup, as they are indeed bandwidth limited. But if more complex operations are involved, there could be some speedup (but rarely equal to the number of threads).

juthohaegeman · May 22, 2019, 7:01am

Where does it say that? If you read this somewhere, this must have been a typo?

PetrKryslUCSD · May 22, 2019, 2:01pm

My apologies, I misread.

PetrKryslUCSD · May 22, 2019, 3:02pm

The computation with @strided hangs with more than one thread, win 10, julia 1.3.0-DEV263. Has anyone else seen this?

PetrKryslUCSD · June 4, 2019, 1:26am

Disappointing performance with WSL on Windows 10:

julia> using Strided
julia> using BenchmarkTools
julia> A = randn(256,256);
julia> B = similar(A);
julia> @btime $B .= ($A .* $A);
  22.000 μs (0 allocations: 0 bytes)
julia> @btime @strided $B .= ($A .* $A);
  32.000 μs (34 allocations: 2.66 KiB)
julia> versioninfo()
Julia Version 1.3.0-DEV.348
Commit b0086f21b1 (2019-06-03 20:19 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 2

Excellent, works very well with Windows 10 itself:

julia> using Strided
julia> using BenchmarkTools
julia> A = randn(256,256);
julia> B = similar(A);
julia> @btime $B .= ($A .* $A);
  21.797 μs (0 allocations: 0 bytes)
julia> @btime @strided $B .= ($A .* $A);
  12.985 μs (20 allocations: 1.17 KiB)
julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\PetrKrysl\AppData\Local\atom\app-1.37.0\atom.exe"  -a
  JULIA_NUM_THREADS = 2

On a “real” Linux machine ;), also very nice:

julia> using Strided
julia> using BenchmarkTools
julia> A = randn(4*256,4*256);
julia> B = similar(A);
julia> @btime $B .= ($A .* $A);
  2.735 ms (0 allocations: 0 bytes)
julia> @btime @strided $B .= ($A .* $A);
  73.374 μs (30 allocations: 11.28 KiB)
julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Opteron(tm) Processor 6380
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, bdver1)
Environment:
  JULIA_NUM_THREADS = 64

Topic		Replies	Views
What is the “standard” way to use multi-threading with dot-calls (vectorized/broadcast)? General Usage multithreading	5	489	April 26, 2024
What is StrideArrays.jl Performance arrays	26	2974	April 20, 2023
Data structures for threaded computing Performance	23	2843	October 23, 2019
[ANN] ThreadsX.jl: Parallelized Base functions Package Announcements package , announcement , parallel	63	6471	February 2, 2022
Hidden allocations when parallelizing with @batch Performance	13	342	March 5, 2024

ANN: Strided.jl , a high-level dense array acceleration package

Related topics