[ANN] ThreadsX.jl: Parallelized Base functions

tkf · March 31, 2020, 8:08am

I guess you are assuming that the outer most axis is much longer than Threads.nthreads() and the work load per element is more-or-less constant? I suppose that’s reasonable when the user is asking to vectorize the loop. But, if you want to parallelize all but the inner most axis, it may be useful to use halve function from SplittablesBase.jl (which can handle IndexCartesian-style arrays and zip of them). This is how I support something like ThreadsX.foreach(f, A, A'). I think this approach is flexible enough to handle “non-rectangular” iterations like upper-triangular part of the matrix.

It’s interesting as I thought reduction would be the easiest part as there is no mutation (e.g., it sounds hard to chunk the iteration space appropriately to avoid simultaneously writing to the shared region). Though maybe there are something subtle when mixing with vectorization? Naively, I’d imagine it’d be implemented as a transformation to mapreduce:

Separate the loop body to the mapping ((m, n) -> x[m] * A[m,n] * y[n]) and reduction (+) parts and generate functions for them.
Determine unroll factor and SIMD vector width.
Feed those functions and parameters to parallelized and vectorized mapreduce.

Topic		Replies	Views
BenchmarkTools for benchmarking thread scalability of functions Performance multithreading , benchmark	6	1488	September 5, 2022
[ANN] Announcing ThreadPinning.jl Package Announcements multithreading	13	1811	August 8, 2024
Decrease in performance using Threads.@threads in Linux Julia at Scale	16	1956	July 23, 2019
[ANN] AcceleratedKernels.jl - Cross-architecture parallel algorithms for Julia's GPU backends Package Announcements package , announcement , gpu , performance , parallel	16	1316	September 27, 2024
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	897	May 13, 2024

[ANN] ThreadsX.jl: Parallelized Base functions

Related topics