On modifying immutables by reference in multithreaded code

multithreading

#1

One often needs to update immutables inplace. For example:

immutable footype_imm
    c1:: Int8
    c2:: Int8
    c3:: Int8
    c4:: Int8
    c5_8::Int32
    c9_16::Int64        
end

function setfoo(A, idx, val) @inbounds begin
    av = A[idx];
    A[idx] = footype_imm(val, av.c2, av.c3, av.c4, av.c5_8, av.c9_16);  
    return 0;
end end

code_native(setfoo, (Vector{footype_imm},Int64,Int8))

	.text
Filename:...
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 11
	shlq	$4, %rsi
	movq	(%rdi), %rax
Source line: 12
	movb	%dl, -16(%rax,%rsi)
Source line: 13
	xorl	%eax, %eax
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

So we see that Julia/LLVM is clever about this: It only writes the updated byte, and does not read anything, hence not stalling on the unneeded memory read.

This is very good, but has different multithreaded semantics from the alternative strategy of reading all the fields, and writing them again.

Sometimes llvm merges multiple writes into a single one (e.g. 4 movb become one movl); this is very good, because it is faster. It has (slightly, rarely) different multithreaded semantics, though.

For these reasons, and 99.9% of code not caring about these details, I think it is actually quite a good idea that the multithreaded semantics of such updates of immutables appears to be undefined (this should maybe get documentation, though?).

I wanted to ask whether there is a way of getting well-defined multithreaded semantics in the remaining 0.1% of code where this matters.

The most important thing would be a way to update only certain fields in-place: That is, a way of writing “setfoo” such that the observed multithreaded semantic is guaranteed and not just maybe compiler-optimized.

Say, I have multiple threads operating on the same array of bitstypes; where different threads update different fields. It feels somehow dangerous to rely on this compiler optimization for correctness of my code. Especially since I fear the day where LLVM/julia becomes so smart that it figures out that it can broaden 7 movb into 1 movq – which is a very good idea in most single-threaded code – but will totally break the semantics of multithreaded code working on this assumption.

…ok, sure, there is a way involving unsafe_store! and pointer arithmetic, which generates the same native code, and should be safe (I guess? as long as I create barriers with @noinline to prevent the optimizer from becoming too smart?) from overzealous llvm optimizations. This is, however, amazingly inconvenient to use.


Performant creation of vector of SVectors given a known formula
#2

Yes, that’s all what you can do now. Ref https://github.com/JuliaLang/julia/pull/21912


#3

Thanks, reading that github reference was kinda depressing, pitting the perfect against the good.

At a very short glance, it looks non-trivial to get the proposed functionality from the github ref as a module? Would you happen to have a link to some code (module, not julia patch) lying around that does all the necessary pointer-arithmetic?

And, seeing that you were heavily involved in the github discussion, I’d ask for your personal guess: Do you think that such functionality will come soon? Or rather that it will come never / in the far future?

If it comes soon, then I might grudgingly rely on the optimization in personal code, checking the generated code_native for thread-safety by hand. If such functionality is a long way off, then I would need to find/write a module that gives me a non-terrible syntax to get the required behavior by pointer arithmetic.

PS. Is my assessment that the current behavior in julia is undefined correct? (I would absolutely never submit this as a bug, even if true, because after reading your link I fear that someone would argue that read-writeback is “the only correct” behavior and shoot performance to hell [undefined behavior that is not used for optimizations, gcc-style removing bounds checks in security relevant code, is harmless])