Distributing loops across threads manually (something like OpenMP)

Hi, I’m quite familiar with the workings of OpenMP and have understood the parallel programming concepts using Fork Join Model. I’m just starting out with julia and was hoping to understand multithreading in julia better. However, I could not find any good resources which can help me out. I couldn’t even find a way in which I can manually distribute work across threads (creating parallel regions, declaring private variables, etc etc). I tried using @threads macro but I’m not getting any speedups. I was doing this on a very simple problem of adding a scalar to each element of a matrix.

function colAccess(A::AbstractMatrix)
	for i in 1:size(A)[1]
		for j in 1:size(A)[2]
			A[j,i] = A[j,i]+1
		end
	end
	return A
end
function parallelAccess!(A::AbstractMatrix)
    @threads for i in 1:size(A)[1]
        for j in 1:size(A)[2]
            A[j,i] = A[j,i]+1
        end
    end
    nothing
end

I’m getting some memory allocations when running the parallel code which I think I might be able to avoid if I can distribute work manually.

I hope someone can help me out here.

Thanks

Something like that may be memory bounded. Do you get any speedup on the same problem using OpenMP?

(Remember starting Julia with -t N, of course).

I may avoid allocations and have a better performance for these fast operations using other threading scheme, as @batch from Polyester.jl. Or @tturbo from LoopVectorization.jl.

I haven’t tried running this on OpenMP but what I’m really interested in is how I can implement fine grained parallelism in Julia.

I’m just trying to understand the concepts that’s why I’m implementing all this from scratch. For real world application I would certainly use libraries but right now I want to understand the workings of multi threading in Julia.

Hi, maybe you find this helpful:

Parallel Computing: [https://docs.julialang.org/en/v1/manual/parallel-computing/]

Basic Threading Examples in JuliaLang v1.3, Jameson Nash, Jeff Bezanson, and Kiran Pamnany: [https://proceedings.juliacon.org/papers/10.21105/jcon.00054]

Announcing composable multi-threaded parallelism in Julia, 23 July 2019; Jeff Bezanson (Julia Computing), Jameson Nash (Julia Computing), Kiran Pamnany (Intel): [https://julialang.org/blog/2019/07/multithreading/]

Notes on Multithreading with Julia; Eric Aubanel; June 2020 (revised August 2020): [http://www.cs.unb.ca/~aubanel/JuliaMultithreadingNotes.html]

A quick introduction to data parallelism in Julia and Data-parallel programming in Julia; Takafumi Arakaki: [https://juliafolds.github.io/data-parallelism/tutorials/quick-introduction/], [https://juliafolds.github.io/data-parallelism/]

3 Likes

Thanks, I read a few articles but it looks like multi threading in Julia isn’t as powerful (or evolved) as something like OpenMP. I hope it gets better someday but until then I’ll have to do most of my work in C.

I was really hoping to have control over things like having variables private to threads, but from what I see there’s no way to do that.

I think this will trigger (on Monday :wink:) interesting discussions. I’m certainly not qualified here. But, for example, if you use a pattern like

@threads for it in 1:nthreads()
       s = 0
       # do something with s
end

The variable s is local to the scope of the loop iteration, thus visible only for the thread in which it was created. I’m not sure if this is the kind of pattern you get in private variables in OpenMP.

Try to do something more substantial than accessing the elements of the array, otherwise the overhead of multithreading shadows any speedup:

% julia -t4 -q
julia> using Base.Threads, BenchmarkTools

julia> function colAccess!(A::AbstractMatrix)
           for i in 1:size(A)[1]
               for j in 1:size(A)[2]
                   A[j,i] = sin(A[j,i])
               end
           end
           return A
       end
colAccess! (generic function with 1 method)

julia> function parallelAccess!(A::AbstractMatrix)
           @threads for i in 1:size(A)[1]
               for j in 1:size(A)[2]
                   A[j,i] = sin(A[j,i])
               end
           end
           return A
       end
parallelAccess! (generic function with 1 method)

julia> @btime colAccess!(A) setup=(A=rand(100, 100));
  66.189 μs (0 allocations: 0 bytes)

julia> @btime parallelAccess!(A) setup=(A=rand(100, 100));
  24.701 μs (21 allocations: 1.86 KiB)

Maybe it’s just me but I think having structure like OpenMP would be so intuitive and easy to implement. In Julia I can’t even find a way to implement something like

#pragma omp critical
{
//statement 1
//statement 2
}

I found a bunch of commands like atomicAdd! But that doesn’t solve the general purpose of having critical section inside a parallel region.

Infact not having a way to create parallel region using some macro is quite disappointing (if there’s something like that please let me know because I can’t find anything).

There are locks: Multi-Threading · The Julia Language

A good idea is to provide a more realistic example of your application. Even if you happen to find out that what you want is hard to be done in Julia, the threads here usually are very instructive.

For worker local variables, you can use @init from FLoops.jl. See, e.g., Efficient and safe approaches to mutation in data parallelism

For critical section, why not just use a lock? For low-level synchronization specific to parallel loops, you can also use a barrier: https://github.com/JuliaConcurrent/SyncBarriers.jl

If you don’t want to use macro for some reason, everything that can be done in parallel for loop can be constructed by just using a parallel reduce implementations like Folds.reduce. Transducers.jl makes it easy to write such parallel programs using just functions.

1 Like

Would you guys have any suggestions with regard to AlphaZero? I know the notes are not perfect, however, I have tried my best to provide as much and as precise info as possible at that moment. Here is the link to the topic titled “Questions on parallelization”: Questions on parallelization · Issue #71 · jonathan-laurent/AlphaZero.jl · GitHub

Hi, I am very sorry for a delay in reply. I am (usually) on the European time. I am affraid, I am not quite in a position to add anything to what is written in the papers I listed above.

Off topic: BTW, if you are into OpenMP, I am just wondering, are you maybe familiar with Global Address SPace toolbox (https://github.com/kpamnany/gasp)? Do you know if AlphaZero computations are maybe irregular?

Hi, sorry I’m not familiar with SPace toolbox.

Thx for the info.

I found the paper: “Dtree: Dynamic Task Scheduling at Petascale” (http://www.cc.gatech.edu/~echow/pubs/dtree.pdf), Kiran Pamnany, Sanchit Misra, Vasimuddin Md., Xing Liu, Edmond Chow, and Srinivas Aluru.

I am particularly interested in your opinion with regard to running AI software, particularly AlphaZero with it. Are there any advantages to be expected over standard Julia parallel abilities? Should you have any comments please let me know.