@parallel constructs

proposal
performance
parallel
distributed

#1

Hope this is the appropriate place to put this, if not please feel free to move it.
I am wondering if Julia would accept a PR for additional @parallel macros optimized for Lower (and Upper) Triangular Matrices.

The benefit is: loops finish in less wall time (in the limit, twice as soon).

The cost is another function in base and a couple of new macros for what
some who knows what they are doing could replicate with user code and @spawn

My working macro name is @parallelLT but I am not attached and would welcome a better (shorter) name especially if there are existing names for fairly allocating triangular domains I am unaware of.

You can test drive the @parallelLT macro by building my fork at:
https://github.com/TomConlin/julia

A crude example of the wrong way to time code (in the global space);

julia> @time (@sync @parallel for i in 1:99999 for j in i:100000  z=i^j; end end)
138.581066 seconds (91.34 k allocations: 4.648 MiB)
4-element Array{Future,1}:
 Future(2, 1, 129, Nullable{Any}())
 Future(3, 1, 130, Nullable{Any}())
 Future(4, 1, 131, Nullable{Any}())
 Future(5, 1, 132, Nullable{Any}())

julia> @time (@sync @parallelLT for i in 1:99999 for j in i:100000  z=i^j; end end)
 89.096365 seconds (171.68 k allocations: 8.759 MiB, 0.01% gc time)
4-element Array{Future,1}:
 Future(2, 1, 137, Nullable{Any}())
 Future(3, 1, 138, Nullable{Any}())
 Future(4, 1, 139, Nullable{Any}())
 Future(5, 1, 140, Nullable{Any}())


#2

IANAD, but would you mind giving some examples what the @parallelLT is supposed to do better? Is the point that when considering only the “for i” loop you get pretty uneven size-distribution of chunks?


#3

I am wondering if that is something that would be better handled in DistributedArrays.jl. where you could use dispatch on the type to make this decision.

I am not a huge fan of the @parallel macro and there are some discussions about what it should mean https://github.com/JuliaLang/julia/issues/19578, so I would be hesitant to add a special-cased version of it right now.


#4

The use case that precipitated this is all-v.s.-all comparisons of genomic features

So tens of thousands of feature vectors are compared where the relation is symmetric, having checked a~b you need not check b~a , you also need not check on a~a so number of comparisons are (N^2)/2-N. Which still ends up being ~1.5 billion comparisons between pairs of vectors several hundred features in length. Shaving a third or more off the time to complete by appending ‘LT’ to @parallel is a non trivial win.

In general, the use case is whenever you have a non uniform workload with respect to your independent variable, if it varies is high to low then LT should do better, if it varies low to high then UT should do better. in both cases they will do best with a work load that varies linearly with the independent variable.

the independent variable is the index produced by the for loop.
so the i in
for i=1:N ... end


#5

I can see there is some consternation, my not well considered 2 cents from helping with graduate classes in parallel programming is,
parallel for in whatever flavor is the GOTO construct for people learning to think beyond serial processing (haha).

I hope Julia do not loose that first easy step from:
my serial program works! to my parallel program works!.

As for dispatching on type, not sure that would work.
True, the results I choose to keep may end up in a Shared or Distributed array
but that has nothing to do with the loops partitioned index which may be ranging over a much larger static structure.