Parallelization of simple loop: reductions, thread-private variables?

antoine-levitt · July 25, 2017, 7:37am

Hi,

I’m looking for a way to parallelize a simple loop I have. I don’t want to get into the complexity of distributed computing, but I’d like something similar to OpenMP. My code structure is basically similar to this:

arr = randn(n)
temp = pre_allocate_stuff()
s = 0.0
for i=1:n
    for j=1:m
        temp[j] = computation(arr[i],j)
    end
    s += reduction(temp)
end

which I would do in OpenMP using a reduction on s and declaring temp as firstprivate. Reading the docs it seems my best bet is Threads.@threads (although in beta, it’s been in the language for a while now), but it doesn’t seem to have support for fancy parallel constructions like reductions or thread private variables. Do I need to do these by hand? Is this something that is in the pipeline and I just need to wait for it?

shorty66 · July 25, 2017, 8:34am

I don’t know much about parallel processing, but you could try the @parallel macro. It allows to specify a reduction.

arr = randn(n)
s = @parallel (reduction) for i=1:n
        @parallel (reduction) for j=1:m
            temp = computation(arr[i],j)
        end
    end

antoine-levitt · July 25, 2017, 11:04am

I’m confused about that but @parallel uses processes, not threads, correct? I was under the impression that threads are better for the kind of fine-grained parallelism I want to use.

ChrisRackauckas · July 25, 2017, 11:42am

Yes and yes. This probably gives all of what you’re looking for:

github.com/JuliaLang/julia

WIP: parallel task runtime

JuliaLang:master ← JuliaLang:kp/partr

opened 01:02PM - 30 Jun 17 UTC

kpamnany

+2459 -852

This replaces the existing fork-join threading infrastructure with a parallel ta…sk runtime (partr) that implements [parallel depth first scheduling](http://www.cs.cmu.edu/%7Echensm/papers/ConstructiveCacheSharing_spaa07.pdf). This model fully supports nested parallelism. The default remains the original threading code. Enable partr by setting `JULIA_PARTR := 1` in your `Make.user`. The core idea is simple -- Julia tasks can now be run by any thread. The task scheduler attempts to order task execution depth-first for provably better cache efficiency, and for true nested parallelism. However, as tasks are an existing thing in Julia and used in a number of places, we're first introducing the infrastructure that will enable parallel tasks with this PR, keeping (hopefully) the serial semantics of the existing task interface. This PR does not introduce any new interface calls for parallel tasks -- those will be in future PRs. All test-cases pass with `JULIA_PARTR` off (as they should). With `JULIA_PARTR` on, all test cases are currently passing on Linux and OS-X. Cc: @JeffBezanson, @vtjnash, @yuyichao, @ViralBShah, @vchuravy, @anton-malakhov.

If not, I’m sure a separate issue for reductions in multithreading would be very appreciated. For now the easiest thing to do is to just not multithread the reduction part if it’s not too expensive. As for thread-private variables, I think that happens naturally in a Threads.@threads loop?

antoine-levitt · July 25, 2017, 11:52am

For thread-private, the issue in the snippet above is that I want to pre-allocate temp and fill it inside the loop. I could do without or handle threads manually, the point was to see if there was a simple modification I could do to my code that enabled multithreading. I’ll just wait for the dust to settle, good to see that this is under active development.

Ralph_Smith · July 25, 2017, 2:42pm

If you do want to do it by hand, here is an example:

function tsum(a)
    nt = nthreads()
    n = length(a)
    nd,nr = divrem(n,nt)
    psums = zeros(nt)
    @threads for i=1:nt
        id = threadid()
        i0=(i-1)*nd
        s = zero(eltype(a))
        @inbounds for ii=1:nd
            s += sin(a[i0+ii])
        end
        psums[id] += s
    end
    if nr > 0
        t = sum(sin.(view(a,nd*nt+1:n)))
    else
        t = zero(eltype(a))
    end
    sum(psums)+t
end

cortner · September 18, 2017, 7:48pm

has anything moved here? It seems a shame that something as basic as a sum cannot be automatically multi-threaded?

After looking at this tsum example I also noticed a limitation of generators. Suppose in the tsum above, a is not an array but a generator, then this doesn’t seem to work at all because one cannot index into it nor jump. On the other hand most generators that I write could actually be indexed into.

To give just a very naive example:

G = ( (x[n] - x[n-1]) * (y[n] + y[n-1]) for n = 2:length(x))
trapz = 0.5 * sum(G)

I can pass G to sum, but I cannot pass it to something like tsum.

This seems a bit artificial, but suppose the object constructed at each step of the generator is expensive to compute and to store.

greg_plowman · September 20, 2017, 9:05am

@parallel and tsum (above) statically partition the input iterator into “chunks”. Each process/thread works on a chunk, so ability to index into iterator is required.

But maybe use can use a mapreduce approach instead.

Here’s a very basic skeleton to demonstrate:

using Base.Threads

function threaded_mapreduce(f, op, v0, itr)
    results = fill(v0, nthreads())
    @threads for x in itr
        tid = threadid()
        @inbounds results[tid] = op(results[tid], f(x))
    end

    # reduce results of each thread
    result = v0
    for x in results
        result = op(result, x)
    end
    return result
end

Test it out.

println("Num threads = ", nthreads())

n = 10000
const x = rand(n);
const y = rand(n);

G = ((x[i] - x[i-1]) * (y[i] + y[i-1]) for i = 2:n)
f(i) = (x[i] - x[i-1]) * (y[i] + y[i-1])

sum(G)
mapreduce(f, +, 2:n)
threaded_mapreduce(f, +, 0.0, 2:n)

Note that threading has some overhead, so you might not see a speed-up unless function f is relatively expensive.

cortner · September 20, 2017, 9:09am

@greg_plowman thank you for the suggestion. I guess the question here is what is the performance of threadid() compared to the cost of evaluating f(x). Makes sense. I’ll play with this, thank you.

CC @gideonsimpson

jcook · September 30, 2017, 9:52pm

Building on @greg_plowman’s response, the version below is closer to a dynamically scheduled threaded reduction operator:

function tmapreduce(f::Function, op, v0, itr)
    @assert length(itr) > 0
    mutex = Mutex()
    output = deepcopy(v0) # make a deepcopy of starting value
    poppable_itr = Vector(deepcopy(itr)) # convert to set to be able to pop
    # array of input arguments to f, to store input for each thread
    inputs = Vector{typeof(itr[1])}(nthreads()) # deprecated soon?
    @threads for i in eachindex(itr)
        lock(mutex)
        inputs[threadid()] = pop!(poppable_itr)
        unlock(mutex)
        loop_output = f(inputs[threadid()])
        lock(mutex)
        output = op(output, loop_output)
        unlock(mutex)
    end
    return output
end

It pops an input argument from the poppable iterator, and passes that to the function. A Mutex is used to maintain thread safety.

For 4 threads, and the following test functions:

function f(i::Int)
    i <= nthreads()^2 && Libc.systemsleep(1.0)
    return 1
end

function g(i::Int)
    i%nthreads() == 0 && Libc.systemsleep(1.0)
    return 1
end

just check my logic:

julia> sum([i<=nthreads()^2 for i in 1:nthreads()^3])
16

julia> sum([i%nthreads()==0 for i in 1:nthreads()^3])
16

which means that iterating over 1:nthreads()^3 should take 16 seconds for both f and g.

A cursory glance at the timings shows:

julia> @time tmapreduce(f, +, 0, 1:nthreads()^3)
  4.012927 seconds (86 allocations: 4.016 KiB)
64
julia> @time tmapreduce(g, +, 0, 1:nthreads()^3)
  5.023034 seconds (86 allocations: 4.016 KiB)
64
julia> @time threaded_mapreduce(f, +, 0, 1:nthreads()^3)
 16.037183 seconds (7 allocations: 336 bytes)
64
julia> @time threaded_mapreduce(g, +, 0, 1:nthreads()^3)
  4.007422 seconds (7 allocations: 336 bytes)
64

But sometimes tmapreduce doesn’t work as intended:

julia> @time tmapreduce(f, +, 0, 1:nthreads()^3)
  16.030326 seconds (86 allocations: 4.016 KiB)
64

tmapreduce isn’t perfect by any means, but sometimes avoids the worst case scenario.

Topic		Replies	Views
Looking for advice on threading General Usage	6	1645	January 15, 2020
Threads/Parallel New to Julia	22	8809	October 24, 2017
Why @parallel doing nothing? General Usage	11	1275	November 13, 2017
Distributing loops across threads manually (something like OpenMP) Performance multithreading	14	1384	November 2, 2021
Parallel for loop without reduction New to Julia	3	1357	May 23, 2017

Parallelization of simple loop: reductions, thread-private variables?

Related topics