I have this function with a parallel loop that probably can benefit from elimination of false sharing.
function(t,x,y,z)
per_point_result_array = ... #preinitialized
@tasks for c in eachindex(gridcells)
surr = ... # finds points of surrounding cells
value = calculation(t[surr],x[surr],y[surr],z[surr])
per_point_result_array[surr] .= value
end
end
Trying to implement task-local values as recommended by OhMyThreads.jl, but my case does not allow to collect, I just need to return the entire series of results per point. Is there a typical pattern for this too?
Not sure I am on the right way:
function(t,x,y,z)
per_point_result_array = ... #preinitialized
per_task_result_array = [Float32[] for _ in 1:nthreads()]
per_task_index_array = [Int64[] for _ in 1:nthreads()]
@tasks for c in eachindex(gridcells) # maintain or do chunk splitting before?
@set chunks = nthreads()
@local ? values indices
surr = ... # finds points of surrounding cells
value = calculation(t[surr],x[surr],y[surr],z[surr])
vcat(values,value)
vcat(indices,surr)
# I can't seem to avoid having to write to the per task arrays every iteration,
# instead of once the batch per task is done
per_task_result_array[thread] = values
end
for i in 1:nthreads()
per_point_result_array[per_task_index_array[i]] .= per_task_result_array[i]
end
end
Or should I think along the lines of:
surrs = tcollect(findsurr(c,gridt,gridc) for c in gridcells)
bundle = [t[surrs],x[surrs],y[surrs],z[surrs]]
tmapreduce(calculation_b, vcat, bundle)
?
Eventually this tcollectwith tmapreduce also worked, as did a @set collect = true returning arrays of (value,surr) before indexed writing into per_point_result_array outside the loop, which would get around the false sharing.
But since it needed to collect data twice, and then enter it in a new loop twice, it ended up being slower and allocating almost twice as much memory. I am now back at a slightly cleaned up code with the single loop and same performance as beforeâŚ
Your example doesnât contain enough context to say anything. Can you give a small complete example?
I.e. a piece of code that can be run, ideally without extra dependencies, that resembles whatever you actually want to do, and that ideally models the relevant behavior, including the correct ballpark for sizes?
By âmodel the behavorâ I mean: You have some calculate function that is called. So give us a calculate function / model / mock-up with the same API as your real function (same input types, same output types, ideally same indices read and written).
Donât forget to copy-paste the example code into a fresh REPL session to verify that it works.
Nobody here can guess what types anything in your code has. Just give us a runnable example/model, and we will figure out the types that appear when running the code.
(julia is a dynamic language. This means that expressions in code do not have types, only values existing at runtime have types. And thanks to unbounded multi-dispatch, calls without types have no behavior. Asking about behavior of non-runnable julia code fragments sucks, donât do this! You must include an entry-point and a runnable mock-up)
1 Like
I was asking for general advice on the pattern to be used for this kind of return, where one wants to update the values in vectors defined before the loop.
I found an answer as I realized that reducing to vcat is possible. To do this, I had to make the function take a single argument. But, I lost 2x the speed and gained 2x the allocations. I think that means the allocations are my main speed problem (I brought it âdownâ to 4 GB for vector sizes of about 500,000, mostly Float32).
I know that further optimization is only possible sharing the code and data. That is a little difficult to do in the forum. But if you are willing to look, I could upload my project and some data.
I donât think you need to share your code or data; you just need to explain your problem better, and I just wanted to help you explain better. At least take this feedback from your question: Some people did not understand what youâre asking about, because you did not provide enough context, or did not find the right words. You repeatedly refer to âthis patternâ, and I have no clue what pattern youâre talking about; nor do I have any clue what youâre attempting to do.
My suggestion for explaining your problem better was: Build a model problem / code that you can share, and that captures the âessenceâ of the thing you care about. Then we all talk on this forum about the model problem, maybe benchmark, optimize, etc; and you continuously transfer the collective insights from the model problem to your actual problem (that you donât need to share). For this transfer step, it is important that the model problem really captures the âessenceâ of your actual problem.
Building minimized model problems is something that is also expected when submitting bug reports. It is a useful skill to learn, and imo a prerequisite for asking strangers for help. But itâs a bit of effort, and not entirely trivial.
If I understand correctly, you have an array of points. For each point, you do some calculation that also requires neighbouring points. The âfalse sharingâ is that there is overlap in the points, aka your sur is not disjoint across tasks?
I think the solution is to figure out how your output array per_point_result_array is partitioned beforehand, and then provide each task one disjoint chunk of it. That way you donât have to worry about multiple tasks writing to the same indice of per_point_result_array.
Correct. surr are the indices of points in the cell and surrounding grid cells. Actually, I discover a mistake in the translation to my example âcodeâ. There is no actual overlap, because I write to the indices belonging only to the points of the central grid space-time cell.
@tasks for c in eachindex(gridcells)
surr = findsurr(uniquecells[c],gridt,gridc,twindow)
value = calculation(t[surr],x[surr],y[surr],z[surr])
per_point_result_array[cellindices[c]] .= value
end
I worried about false sharing, because my function follows the âbadâ example of false sharing explained on OhMyThreads.jl.
Even though these indices are unique per task, threads treating different cells may want write access to the same 64-byte cache simultaneously, as the original data is typically 32-bit float points sorted by time, so 16 adjacent points share CPU cache memory, if I understand correctly.
https://juliafolds2.github.io/OhMyThreads.jl/stable/literate/falsesharing/falsesharing/
But perhaps false sharing is rare in my case, because I feed the loop âcellsâ that represent unique combinations of time and spatial cells: Point2.(gridt,gridc[:,1]).
Is this perhaps the â11 lock conflictsâ in the @time output?
Is this the false sharing thing youâre worried about? This only matters if cellindices[c] is relatively large, and spread out.
You can avoid it by collecting the c, value pairs computed in your tasks, and then having a single task that does
for (c, value) in res
per_point_result_array[cellindices[c]] .= value
end
Whether that is better or worse depends a lot on details â try it out!
No. False sharing doesnât directly cause lock conflicts. The lock conflicts are probably benign and inside the scheduler.
1 Like