How to create private arrays for each thread before a loop?

pedrohnv · February 21, 2020, 6:25pm

I have a loop that does some linear algebra. Each iteration is independent of each other. I need some helper arrays in each iteration. How can I make each thread allocate its helper arrays once before the loop?

In the example below, I would like to make each thread allocated a single time its own copy of a and b. How can that be done?

function foo(n)
    out = Array{Float64}(undef, n)
    Threads.@threads for i = 1:n
        a = Array{Float64}(undef, n, n)
        b = Array{Float64}(undef, n)
        for j = 1:n
            for k = 1:n
                a[k,j] = j^k
            end
            b[j] = i * j
        end
        out[i] = (a\b)[1]
    end
    return out
end

PetrKryslUCSD · February 21, 2020, 6:41pm

I allocate buffers outside of the loop. FinEtoolsDeforNonlinear.jl/cantilever_dyn_examples.jl at 9f9704e78420f4325f6ad6bd431894dc4df574cf · PetrKryslUCSD/FinEtoolsDeforNonlinear.jl · GitHub

tkf · February 21, 2020, 10:39pm

For the moment you may be able to do something like

function foo(n)
    out = Array{Float64}(undef, n)
    abufs = [Array{Float64}(undef, n, n) for _ in 1:Threads.nthreads()]
    bbufs = [Array{Float64}(undef, n) for _ in 1:Threads.nthreads()]
    Threads.@threads for i = 1:n
        a = abufs[Threads.threadid()]
        b = bbufs[Threads.threadid()]

although this depends on the internal detail on how tasks are scheduled (so it may not be safe in the future and there are usecases it’s already unsafe). See also: Task affinity to threads

FYI, note also that BLAS and Julia threading do not play very well ATM:

github.com/JuliaLang/julia

partr thread support for openblas

opened 07:29PM - 04 Aug 19 UTC

closed 01:26AM - 30 Jan 22 UTC

ViralBShah

linear algebra multithreading

Here are some notes from digging into the openblas codebase (with @stevengj) to …enable partr threading support. 1. [`exec_blas`](https://github.com/xianyi/OpenBLAS/blob/96a794e9fd9fdc2b03a01b3dabd0a10006d0aa98/driver/others/blas_server_omp.c#L308) is called by all the routines. The code pattern followed is setting up the work queue and calling `exec_blas` to do all the work through an [openmp pragma](https://github.com/xianyi/OpenBLAS/blob/96a794e9fd9fdc2b03a01b3dabd0a10006d0aa98/driver/others/blas_server_omp.c#L338). 2. The exception is lapack routines, which also use the `exec_blas_async` functions. 3. The openmp backend doesn’t seem to implement the async and thus I believe that it will not multi-thread the lapack calls. 4. [Windows](https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server_win32.c) has its own threading backend The easiest way may be to modify the openmp threading backend, which seems amenable to something like the [fftw partr backend](https://github.com/JuliaMath/FFTW.jl/pull/105). To start with, we should ignore lapack threading. We could probably just implement an `exec_blas_async` fallback that calls `exec_blas` (and make `exec_blas_async_wait` a no-op). All of this should work on windows too, although the going through the openmp build route may need some work on the makefiles. The [patch to FFTW](https://github.com/JuliaMath/FFTWBuilder/pull/1) should be indicative of something similar to be done for the openblas build.

github.com/JuliaLang/julia

Does the number of threads affect the executed code?

opened 02:45PM - 14 Dec 19 UTC

closed 08:25PM - 29 Sep 21 UTC

PetrKryslUCSD

parallelism needs more info

The work is proportional to the number of elements. For a mesh of 128000 elemen…ts both a serial and 1-thread simulation carry out the computational work in 2.5 seconds. For a mesh of 1024000 elements both a serial and 1-thread simulation carry out the computational work in around 20.0 seconds. So, eight times more work, eight times longer. Now comes the weird part. When I use 2 threads, so that each thread works on 512000 elements, the amount of work per thread is 10 seconds. However the work procedure shows that it consumes around 16.5 seconds. When I use 4 threads, each thread works on 256,000 elements, and consequently the work procedure should execute in 5 seconds. However, the work procedure actually shows that it consumes roughly 15.6 seconds. With 8 threads, each thread works on 128,000 elements, and the work procedure should only take 2.5 seconds. However, it reports to take roughly 14 seconds. The threaded execution therefore looks like this: Number of elements Number of threads Execution time per thread 1024000 1 20 512000 2 16.5 256000 4 15.6 128000 8 14 The weird thing is I time the interior of the work procedure. So that should exclude any overhead associated with threading. However, as you can see the number of threads actually affects how much time the work procedure spends doing the work. The total amount of time farming out the work to the threads is very small. The total amount of time collecting the data with `wait` pretty much is equal to the amount of time reported by the work procedure. As if the overhead related to threading was very small. The whole thing can be exercised by ``` git clone https://github.com/PetrKryslUCSD/FinEtoolsDeforNonlinear.jl ``` followed by ``` cd FinEtoolsDeforNonlinear.jl export JULIA_NUM_THREADS=8 julia ``` and ``` include("threaded_test.jl") ``` I'm sorry I don't have a more minimal working example!

Topic		Replies	Views
Parellel arrays General Usage multithreading	4	512	March 6, 2020
Thread-safe array building General Usage multithreading	21	7525	October 24, 2017
Simple multi-thread loop with array Performance question , parallel , multithreading	11	762	April 13, 2021
Two Questions About Multithreading Performance	5	1773	September 18, 2018
Threads/Parallel New to Julia	22	8719	October 24, 2017

How to create private arrays for each thread before a loop?

Related topics