Disable compiler optimization

is there a reliable way to disable optimization for code like this?

function read_reg64(base::Ptr{UInt32}, offset::Int)
    lower = unsafe_load(base, offset + 1)
    upper = unsafe_load(base, offset + 2) % UInt64

    upper << 32 + lower
end

this gets optimized to

movq    (%rdi,%rsi,4), %rax

which is fair, but base points at some device memory that doesn’t support 64-bit reads. we must do 2 32-bit reads instead and combine results as prescribed.

are you saying LLVM mis-compiled? what platform are you on?

you shouldn’t need to “disable optimization” – compiler shouldn’t give you illegal optimization on that platform

optimization is not illegal if base points at your conventional RAM. In my case it points at memory-mapped device and optimized code doesn’t produce the same result.

in c/c++ you’d typically use #pragma optimize( "", off ) or something along these lines

if the device is memory-mapped, then what does it mean to say “it doesn’t support 64-bit read” if your host Arch/OS is 64-bit?

if the memory-map works correctly (which is OS’s job, and then your device driver’s job?), it will be able to handle whatever OS “read” call correctly?

If you’re DYI/hacking together a memory management something, then you still shouldn’t rely on “let’s turn off optimization and hope it compiles to this specific thing”. In this case you may want to use LLVM.jl directly and handcraft exactly what you want

1 Like

That doesn’t sound legal, the OS should’ve mapped your device to somewhere in the 64bit address space, unless you are using a 32 bit only program for 64 bit julia, but I guess if that were the case things would’ve gone wrong before.

Hi @green.nsk, I’ve actually has the same situation arise when reading from memory mapped hardware performance counter registers. My solution at that time was to put the pointer loads behind functions marked as @noinline to prevent the read coalescing. With Julia 1.8, callsites can be annotated as @noinline, which would look something like this:

function read_reg64(base::Ptr{UInt32}, offset::Int)
    lower_offset = offset + 1
    lower = @noinline unsafe_load(base, lower_offset)
    upper_offset = offset + 2
    upper = @noinline unsafe_load(base, upper_offset)
    return ((upper % UInt64) << 32) + lower
end

This yields the following assembly.

push    rbp
.cfi_def_cfa_offset 16
push    r14
.cfi_def_cfa_offset 24
push    rbx
.cfi_def_cfa_offset 32
.cfi_offset rbx, -32
.cfi_offset r14, -24
.cfi_offset rbp, -16
mov     rbx, rsi
mov     r14, rdi
inc     rsi
movabs  rax, offset j_unsafe_load_2080
call    rax
mov     ebp, eax
add     rbx, 2
movabs  rax, offset j_unsafe_load_2081
mov     rdi, r14
mov     rsi, rbx
call    rax
                                # kill: def $eax killed $eax def $rax
shl     rax, 32
mov     ecx, ebp
or      rax, rcx
pop     rbx
.cfi_def_cfa_offset 24
pop     r14
.cfi_def_cfa_offset 16
pop     rbp
.cfi_def_cfa_offset 8
ret

It’s not ideal because it uses a whole function call just to load a 32-bit integer (i.e. execute a single instruction), but hopefully helps fix the issue.

1 Like

I think you’re looking for the equivalent to Cs volatile. Julia itself doesn’t expose those semantics, but this should work for your case:

function volatile_load(x::Ptr{UInt32})
        @inline
        return Base.llvmcall(
            """
            %ptr = inttoptr i64 %0 to i32*
            %val = load volatile i32, i32* %ptr, align 1
            ret i32 %val
            """,
            UInt32,
            Tuple{Ptr{UInt32}},
            x
        )
end

, assuming you’re on a 64 bit machine (32 bit would have you do inttoptr i32 instead). You may have to adjust that in a future version due to opaque pointers, to do load volatile i32, ptr %ptr, align 1 instead. It’s used like

function read_reg64(base::Ptr{UInt32}, offset::Int)
    lower = volatile_load(base + sizeof(UInt32)*(offset + 1))
    upper = volatile_load(base + sizeof(UInt32)*(offset + 2)) % UInt64

    (upper << 32) | lower
end

though I’m not 100% sure about the +1 business there, since raw unsafe_load already assumes i starts at 1 (i.e. it takes the 1-indexed conversion into account already). I wrote the above assuming you want offset to be -1-based.

unsafe_load(p::Ptr{T}, i::Integer=1)

Load a value of type T from the address of the ith element (1-indexed) starting at p. This is equivalent to the C expression p[i-1].

2 Likes

Thank you both, @Sukera @hildebrandmw ! Both solutions work great!

I imagine the @noinline unsafe_load() may indeed be marginally slower, however the underlying PCIe roundtrip likely dwarfs the overhead.

1 Like