tkf
February 21, 2020, 10:39pm
3
For the moment you may be able to do something like
function foo(n)
out = Array{Float64}(undef, n)
abufs = [Array{Float64}(undef, n, n) for _ in 1:Threads.nthreads()]
bbufs = [Array{Float64}(undef, n) for _ in 1:Threads.nthreads()]
Threads.@threads for i = 1:n
a = abufs[Threads.threadid()]
b = bbufs[Threads.threadid()]
although this depends on the internal detail on how tasks are scheduled (so it may not be safe in the future and there are usecases it’s already unsafe). See also: Task affinity to threads
FYI, note also that BLAS and Julia threading do not play very well ATM:
opened 07:29PM - 04 Aug 19 UTC
closed 01:26AM - 30 Jan 22 UTC
linear algebra
multithreading
Here are some notes from digging into the openblas codebase (with @stevengj) to … enable partr threading support.
1. [`exec_blas`](https://github.com/xianyi/OpenBLAS/blob/96a794e9fd9fdc2b03a01b3dabd0a10006d0aa98/driver/others/blas_server_omp.c#L308) is called by all the routines. The code pattern followed is setting up the work queue and calling `exec_blas` to do all the work through an [openmp pragma](https://github.com/xianyi/OpenBLAS/blob/96a794e9fd9fdc2b03a01b3dabd0a10006d0aa98/driver/others/blas_server_omp.c#L338).
2. The exception is lapack routines, which also use the `exec_blas_async` functions.
3. The openmp backend doesn’t seem to implement the async and thus I believe that it will not multi-thread the lapack calls.
4. [Windows](https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server_win32.c) has its own threading backend
The easiest way may be to modify the openmp threading backend, which seems amenable to something like the [fftw partr backend](https://github.com/JuliaMath/FFTW.jl/pull/105). To start with, we should ignore lapack threading. We could probably just implement an `exec_blas_async` fallback that calls `exec_blas` (and make `exec_blas_async_wait` a no-op).
All of this should work on windows too, although the going through the openmp build route may need some work on the makefiles.
The [patch to FFTW](https://github.com/JuliaMath/FFTWBuilder/pull/1) should be indicative of something similar to be done for the openblas build.
opened 02:45PM - 14 Dec 19 UTC
closed 08:25PM - 29 Sep 21 UTC
parallelism
needs more info
The work is proportional to the number of elements.
For a mesh of 128000 elemen… ts both a serial and 1-thread
simulation carry out the computational work in 2.5 seconds.
For a mesh of 1024000 elements both a serial and 1-thread
simulation carry out the computational work in around 20.0 seconds.
So, eight times more work, eight times longer.
Now comes the weird part. When I use 2 threads, so that each thread
works on 512000 elements, the amount of work per thread is
10 seconds. However the work procedure shows that it consumes
around 16.5 seconds.
When I use 4 threads, each thread works on 256,000 elements,
and consequently the work procedure should execute
in 5 seconds. However, the work procedure actually shows
that it consumes roughly 15.6 seconds.
With 8 threads, each thread works on 128,000 elements,
and the work procedure should only take 2.5 seconds.
However, it reports to take roughly 14 seconds.
The threaded execution therefore looks like this:
Number of elements Number of threads Execution time
per thread
1024000 1 20
512000 2 16.5
256000 4 15.6
128000 8 14
The weird thing is I time the interior of the work procedure.
So that should exclude any overhead associated with threading.
However, as you can see the number of threads actually affects
how much time the work procedure spends doing the work.
The total amount of time farming out the work to the threads is
very small. The total amount of time collecting the data with
`wait` pretty much is equal to the amount of time reported by
the work procedure. As if the overhead related to threading was
very small.
The whole thing can be exercised by
```
git clone https://github.com/PetrKryslUCSD/FinEtoolsDeforNonlinear.jl
```
followed by
```
cd FinEtoolsDeforNonlinear.jl
export JULIA_NUM_THREADS=8
julia
```
and
```
include("threaded_test.jl")
```
I'm sorry I don't have a more minimal working example!
2 Likes