Avoiding false sharing in parallel loop without collecting

I have this function with a parallel loop that probably can benefit from elimination of false sharing.

function(t,x,y,z)
    per_point_result_array =  ... #preinitialized
    @tasks for c in eachindex(gridcells)
        surr = ... # finds points of surrounding cells
        value = calculation(t[surr],x[surr],y[surr],z[surr])
        per_point_result_array[surr] .= value
    end
end

Trying to implement task-local values as recommended by OhMyThreads.jl, but my case does not allow to collect, I just need to return the entire series of results per point. Is there a typical pattern for this too?
Not sure I am on the right way:

function(t,x,y,z)
    per_point_result_array =  ... #preinitialized
    per_task_result_array =  [Float32[] for _ in 1:nthreads()]
    per_task_index_array =  [Int64[] for _ in 1:nthreads()]
    @tasks for c in eachindex(gridcells)  # maintain or do chunk splitting before?
        @set chunks = nthreads()
        @local ?  values  indices
        surr = ... # finds points of surrounding cells
        value = calculation(t[surr],x[surr],y[surr],z[surr])
        vcat(values,value)
        vcat(indices,surr)
        # I can't seem to avoid having to write to the per task arrays every iteration,
        # instead of once the batch per task is done
        per_task_result_array[thread] = values  
    end
   for i in 1:nthreads()
        per_point_result_array[per_task_index_array[i]] .= per_task_result_array[i] 
   end
end

Or should I think along the lines of:

surrs = tcollect(findsurr(c,gridt,gridc) for c in gridcells) 
bundle = [t[surrs],x[surrs],y[surrs],z[surrs]]
tmapreduce(calculation_b, vcat, bundle)

?

Eventually this tcollectwith tmapreduce also worked, as did a @set collect = true returning arrays of (value,surr) before indexed writing into per_point_result_array outside the loop, which would get around the false sharing.

But since it needed to collect data twice, and then enter it in a new loop twice, it ended up being slower and allocating almost twice as much memory. I am now back at a slightly cleaned up code with the single loop and same performance as before…

Your example doesn’t contain enough context to say anything. Can you give a small complete example?

I.e. a piece of code that can be run, ideally without extra dependencies, that resembles whatever you actually want to do, and that ideally models the relevant behavior, including the correct ballpark for sizes?

By “model the behavor” I mean: You have some calculate function that is called. So give us a calculate function / model / mock-up with the same API as your real function (same input types, same output types, ideally same indices read and written).

Don’t forget to copy-paste the example code into a fresh REPL session to verify that it works.

Nobody here can guess what types anything in your code has. Just give us a runnable example/model, and we will figure out the types that appear when running the code.

(julia is a dynamic language. This means that expressions in code do not have types, only values existing at runtime have types. And thanks to unbounded multi-dispatch, calls without types have no behavior. Asking about behavior of non-runnable julia code fragments sucks, don’t do this! You must include an entry-point and a runnable mock-up)

1 Like