Avoiding false sharing in parallel loop without collecting

I have this function with a parallel loop that probably can benefit from elimination of false sharing.

function(t,x,y,z)
    per_point_result_array =  ... #preinitialized
    @tasks for c in eachindex(gridcells)
        surr = ... # finds points of surrounding cells
        value = calculation(t[surr],x[surr],y[surr],z[surr])
        per_point_result_array[surr] .= value
    end
end

Trying to implement task-local values as recommended by OhMyThreads.jl, but my case does not allow to collect, I just need to return the entire series of results per point. Is there a typical pattern for this too?
Not sure I am on the right way:

function(t,x,y,z)
    per_point_result_array =  ... #preinitialized
    per_task_result_array =  [Float32[] for _ in 1:nthreads()]
    per_task_index_array =  [Int64[] for _ in 1:nthreads()]
    @tasks for c in eachindex(gridcells)  # maintain or do chunk splitting before?
        @set chunks = nthreads()
        @local ?  values  indices
        surr = ... # finds points of surrounding cells
        value = calculation(t[surr],x[surr],y[surr],z[surr])
        vcat(values,value)
        vcat(indices,surr)
        # I can't seem to avoid having to write to the per task arrays every iteration,
        # instead of once the batch per task is done
        per_task_result_array[thread] = values  
    end
   for i in 1:nthreads()
        per_point_result_array[per_task_index_array[i]] .= per_task_result_array[i] 
   end
end

Or should I think along the lines of:

surrs = tcollect(findsurr(c,gridt,gridc) for c in gridcells) 
bundle = [t[surrs],x[surrs],y[surrs],z[surrs]]
tmapreduce(calculation_b, vcat, bundle)

?

Eventually this tcollectwith tmapreduce also worked, as did a @set collect = true returning arrays of (value,surr) before indexed writing into per_point_result_array outside the loop, which would get around the false sharing.

But since it needed to collect data twice, and then enter it in a new loop twice, it ended up being slower and allocating almost twice as much memory. I am now back at a slightly cleaned up code with the single loop and same performance as before…

Your example doesn’t contain enough context to say anything. Can you give a small complete example?

I.e. a piece of code that can be run, ideally without extra dependencies, that resembles whatever you actually want to do, and that ideally models the relevant behavior, including the correct ballpark for sizes?

By “model the behavor” I mean: You have some calculate function that is called. So give us a calculate function / model / mock-up with the same API as your real function (same input types, same output types, ideally same indices read and written).

Don’t forget to copy-paste the example code into a fresh REPL session to verify that it works.

Nobody here can guess what types anything in your code has. Just give us a runnable example/model, and we will figure out the types that appear when running the code.

(julia is a dynamic language. This means that expressions in code do not have types, only values existing at runtime have types. And thanks to unbounded multi-dispatch, calls without types have no behavior. Asking about behavior of non-runnable julia code fragments sucks, don’t do this! You must include an entry-point and a runnable mock-up)

1 Like

I was asking for general advice on the pattern to be used for this kind of return, where one wants to update the values in vectors defined before the loop.

I found an answer as I realized that reducing to vcat is possible. To do this, I had to make the function take a single argument. But, I lost 2x the speed and gained 2x the allocations. I think that means the allocations are my main speed problem (I brought it “down” to 4 GB for vector sizes of about 500,000, mostly Float32).

I know that further optimization is only possible sharing the code and data. That is a little difficult to do in the forum. But if you are willing to look, I could upload my project and some data.

I don’t think you need to share your code or data; you just need to explain your problem better, and I just wanted to help you explain better. At least take this feedback from your question: Some people did not understand what you’re asking about, because you did not provide enough context, or did not find the right words. You repeatedly refer to “this pattern”, and I have no clue what pattern you’re talking about; nor do I have any clue what you’re attempting to do.

My suggestion for explaining your problem better was: Build a model problem / code that you can share, and that captures the “essence” of the thing you care about. Then we all talk on this forum about the model problem, maybe benchmark, optimize, etc; and you continuously transfer the collective insights from the model problem to your actual problem (that you don’t need to share). For this transfer step, it is important that the model problem really captures the “essence” of your actual problem.

Building minimized model problems is something that is also expected when submitting bug reports. It is a useful skill to learn, and imo a prerequisite for asking strangers for help. But it’s a bit of effort, and not entirely trivial.

If I understand correctly, you have an array of points. For each point, you do some calculation that also requires neighbouring points. The ‘false sharing’ is that there is overlap in the points, aka your sur is not disjoint across tasks?

I think the solution is to figure out how your output array per_point_result_array is partitioned beforehand, and then provide each task one disjoint chunk of it. That way you don’t have to worry about multiple tasks writing to the same indice of per_point_result_array.

Correct. surr are the indices of points in the cell and surrounding grid cells. Actually, I discover a mistake in the translation to my example “code”. There is no actual overlap, because I write to the indices belonging only to the points of the central grid space-time cell.

    @tasks for c in eachindex(gridcells)
        surr = findsurr(uniquecells[c],gridt,gridc,twindow)
        value = calculation(t[surr],x[surr],y[surr],z[surr])
        per_point_result_array[cellindices[c]] .= value
    end

I worried about false sharing, because my function follows the “bad” example of false sharing explained on OhMyThreads.jl.

Even though these indices are unique per task, threads treating different cells may want write access to the same 64-byte cache simultaneously, as the original data is typically 32-bit float points sorted by time, so 16 adjacent points share CPU cache memory, if I understand correctly.
https://juliafolds2.github.io/OhMyThreads.jl/stable/literate/falsesharing/falsesharing/
But perhaps false sharing is rare in my case, because I feed the loop “cells” that represent unique combinations of time and spatial cells: Point2.(gridt,gridc[:,1]).
Is this perhaps the “11 lock conflicts” in the @time output?

Is this the false sharing thing you’re worried about? This only matters if cellindices[c] is relatively large, and spread out.

You can avoid it by collecting the c, value pairs computed in your tasks, and then having a single task that does

for (c, value) in res
    per_point_result_array[cellindices[c]] .= value
end

Whether that is better or worse depends a lot on details – try it out!

No. False sharing doesn’t directly cause lock conflicts. The lock conflicts are probably benign and inside the scheduler.

1 Like