Writable global const arrays

I tried to implement the xoshiro RNG for Base julia (cf https://github.com/JuliaLang/julia/issues/27614).

So, for this to work properly, I thought about using something like

const global_xorostate = zeros(UInt64, Threads.nthreads(), 5, 8)

In reality, that would wrap a ccall(:posix_memalign, ...): Each thread gets 5 cache lines to play with.

Now I am already lost:

julia> f()=global_xorostate[1]
f (generic function with 1 method)

julia> @code_typed f()
1 ─ %1 = invoke Base.getindex(Main.global_xorostate::Array{UInt64,3}, 1::Int64)::UInt64
└──      return %1
) => UInt64

julia> @code_native f()
; β”Œ @ REPL[12]:1 within `f'
	pushq	%rax
	movabsq	$julia_getindex_16904, %rax
	movabsq	$139722737349536, %rdi  # imm = 0x7F13BC2073A0
	movl	$1, %esi
	callq	        *%rax
	popq	%rcx

This should be a single memory load from a known address, because the compiler should know that pointer_from_objref(global_xorostate) cannot change, and it also should know that pointer(global_xorostate) cannot change, nor the size (because it is not one-dimensional).

So, my question: How do I tell julia 1.4/master that it is perfectly fine to chase these pointers at compile time instead of runtime? How do I get rid of the invoke?

Or should I try something else?

My problem with struct is that I don’t know how to force 64 byte alignment. A secondary problem is that I don’t know how to avoid an additional indirection through Threads.threadid() with structs. Threads.threadid() should only be used as an offset for loads of payload, not as an offset to load a pointer to payload. (I mostly know the desired assembly code, and my issue is how to coax julia into emitting that)

1 Like

I overuse generated functions, and should really spend more time considering alternatives.
That said

julia> @generated g() = :(unsafe_load($(pointer(global_xorostate))))
g (generic function with 1 method)

julia> @code_typed g()
1 ─ %1 = Base.pointerref(Ptr{UInt64} @0x00007f0e5575fd40, 1, 1)::UInt64
└──      return %1
) => UInt64

julia> @code_native g()
	movabsq	$139699540065600, %rax  # imm = 0x7F0E5575FD40
	movq	(%rax), %rax

Did you already consider and reject this approach?

Also, wouldn’t a simple pointer (like your ccall(:posix_memalign, ...)) produce the asm you wanted?

julia> align(x::T, a = 64) where {T} = reinterpret(T, (reinterpret(UInt,x) + a - 1) & (-a))

julia> const XORO_STATE_PTR = Base.unsafe_convert(Ptr{UInt64}, align(Libc.malloc(sizeof(UInt64) * Threads.nthreads() * 5 * 8 + 63), 64))
Ptr{UInt64} @0x000056418ca4af00

julia> reinterpret(UInt, XORO_STATE_PTR) % 64

julia> h() = unsafe_load(XORO_STATE_PTR)
h (generic function with 1 method)

julia> h()

julia> @code_typed h()
1 ─ %1 = Main.XORO_STATE_PTR::Core.Compiler.Const(Ptr{UInt64} @0x000056418ca4af00, false)
β”‚   %2 = Base.pointerref(%1, 1, 1)::UInt64
└──      return %2
) => UInt64

julia> @code_native h()
	movabsq	$94839532465920, %rax   # imm = 0x56418CA4AF00
	movq	(%rax), %rax

You can of course force 64 byte alignment by just allocating extra memory, and then increment the pointer to the next multiple of 64.

There will be problems with functions like these in a module though, because that pointer will not be constant between module loads.

So the best way would be

julia> mutable struct ptrwrap

julia> const global_rng = ptrwrap(C_NULL)

and a global_rng.ptr = call(:posix_memalign, ...) in the __init__ function, plus a definition default_rng() = convert(Ptr{rng_instance}, global_rng.ptr + (Threads.threadid() - 1%Int16)*320?

Master hasn’t solved this issue either: There is a pesky invoke in Base.default_rng() as well (probably why rand() is so slow with multithreading).