I’ve been trying to adjust a parallel loop after some changes, and I need some clarification to make sure I’m not at risk of false sharing.
The loop conceptually works like this:
object # holds a state, which is a tuple of a mix of scalar and vector variables
helper = prepare(n_steps)
output = preallocate(n_steps)
vars = keys(output[1])
state_template = init_state(object.current_state,1) # creates a tuple with the scalars + ith row of the vector variables
@floop ThreadedEx() for (i, helper_i) in enumerate(helper)
@init begin
state = deepcopy(state_template) # thread-local init
end
state = init_state(object.current_state, i) # each thread uses vector vars starting at a different row
compute_state(state, helper_i)
for vars in vars
output[i][var] = state[var] # threads access different rows
end
end
I get:
HasBoxedVariableError: Closure ##reducing_function#233#255 (defined in custom_package) has 1 boxed variable: output
I have identical results in single and multi-thread runs; different threads don’t write to the same output row.
Am I causing some false sharing though ? Assuming the output row strides cache lines, I think two threads might write to the same cache line and cause some thrashing ?
if so, how should I adjust my code so that I get no Box error ? The examples given in the FLoops documentation only showcase scalar variables. Do I need to somehow preallocate or partition my output per-thread and then recombine it, or is there a way to pad it ?
Or am I barking up the wrong tree and there are no potential false sharing issues ?
The simplest solution, when possible, could be to declare the type of output, if it’s known concretely upfront:
output::ConcreteType = preallocate(n_steps)
The context here is the notorious Performance of captured variable issue, which might lead to subtle bugs in cases like these, which is why the error was introduced AFAIK.
Can you find all instances of the output variable being assigned to, in your actual code? That is, expressions of the form output = expr? I’m guessing there’s more such examples in your actual code than in the snippet you gave.
Aah, I return the output structure immediately after the floop loop, but I did have a single-threaded version in a later section within the same function which made use of the same variable name.
It looks like having two different names fixes the issue. I guess it was ambiguous for the compiler that there couldn’t be any kind of interference between those two sections of code.
My mental model of Julia’s multithreading is still pretty shaky (does false sharing still risk occurring ?), but it looks like my problem is solved.
Thanks a lot !
@Mason I haven’t read that link, (I did read FLoops’s docs), I’ll check it out, thanks as well.
For the record, the issue of closure captures getting boxed is not specifically related to multithreading. Some attempts at improving the situation are by improving the Julia compiler’s lowering pass (which comes after parsing and before inference), while others rely on making the optimizer/inference do more work.
What makes this relevant to multithreading is basically just that closures are used naturally and often as part of any such interface. Coupled with the fact that boxing may cause subtle bugs in concurrent usage.
In case you don’t know, a closure is a “local” function that may “capture” variables from its enclosing scope.
In my opinion, avoiding the reuse of variable names is a good practice in general. So I, more or less, try to write in a single-assignment form. Many will disagree, I guess. One of the benefits is having more information while debugging.