Disable compiler optimization

green.nsk · April 19, 2023, 3:56pm

is there a reliable way to disable optimization for code like this?

function read_reg64(base::Ptr{UInt32}, offset::Int)
    lower = unsafe_load(base, offset + 1)
    upper = unsafe_load(base, offset + 2) % UInt64

    upper << 32 + lower
end

this gets optimized to

movq    (%rdi,%rsi,4), %rax

which is fair, but base points at some device memory that doesn’t support 64-bit reads. we must do 2 32-bit reads instead and combine results as prescribed.

jling · April 19, 2023, 4:16pm

are you saying LLVM mis-compiled? what platform are you on?

you shouldn’t need to “disable optimization” – compiler shouldn’t give you illegal optimization on that platform

green.nsk · April 19, 2023, 4:25pm

optimization is not illegal if base points at your conventional RAM. In my case it points at memory-mapped device and optimized code doesn’t produce the same result.

in c/c++ you’d typically use #pragma optimize( "", off ) or something along these lines

jling · April 19, 2023, 6:10pm

if the device is memory-mapped, then what does it mean to say “it doesn’t support 64-bit read” if your host Arch/OS is 64-bit?

if the memory-map works correctly (which is OS’s job, and then your device driver’s job?), it will be able to handle whatever OS “read” call correctly?

If you’re DYI/hacking together a memory management something, then you still shouldn’t rely on “let’s turn off optimization and hope it compiles to this specific thing”. In this case you may want to use LLVM.jl directly and handcraft exactly what you want

gbaraldi · April 19, 2023, 8:11pm

That doesn’t sound legal, the OS should’ve mapped your device to somewhere in the 64bit address space, unless you are using a 32 bit only program for 64 bit julia, but I guess if that were the case things would’ve gone wrong before.

hildebrandmw · April 20, 2023, 1:47am

Hi @green.nsk, I’ve actually has the same situation arise when reading from memory mapped hardware performance counter registers. My solution at that time was to put the pointer loads behind functions marked as @noinline to prevent the read coalescing. With Julia 1.8, callsites can be annotated as @noinline, which would look something like this:

function read_reg64(base::Ptr{UInt32}, offset::Int)
    lower_offset = offset + 1
    lower = @noinline unsafe_load(base, lower_offset)
    upper_offset = offset + 2
    upper = @noinline unsafe_load(base, upper_offset)
    return ((upper % UInt64) << 32) + lower
end

This yields the following assembly.

push    rbp
.cfi_def_cfa_offset 16
push    r14
.cfi_def_cfa_offset 24
push    rbx
.cfi_def_cfa_offset 32
.cfi_offset rbx, -32
.cfi_offset r14, -24
.cfi_offset rbp, -16
mov     rbx, rsi
mov     r14, rdi
inc     rsi
movabs  rax, offset j_unsafe_load_2080
call    rax
mov     ebp, eax
add     rbx, 2
movabs  rax, offset j_unsafe_load_2081
mov     rdi, r14
mov     rsi, rbx
call    rax
                                # kill: def $eax killed $eax def $rax
shl     rax, 32
mov     ecx, ebp
or      rax, rcx
pop     rbx
.cfi_def_cfa_offset 24
pop     r14
.cfi_def_cfa_offset 16
pop     rbp
.cfi_def_cfa_offset 8
ret

It’s not ideal because it uses a whole function call just to load a 32-bit integer (i.e. execute a single instruction), but hopefully helps fix the issue.

Sukera · April 20, 2023, 9:48am

I think you’re looking for the equivalent to Cs volatile. Julia itself doesn’t expose those semantics, but this should work for your case:

function volatile_load(x::Ptr{UInt32})
        @inline
        return Base.llvmcall(
            """
            %ptr = inttoptr i64 %0 to i32*
            %val = load volatile i32, i32* %ptr, align 1
            ret i32 %val
            """,
            UInt32,
            Tuple{Ptr{UInt32}},
            x
        )
end

, assuming you’re on a 64 bit machine (32 bit would have you do inttoptr i32 instead). You may have to adjust that in a future version due to opaque pointers, to do load volatile i32, ptr %ptr, align 1 instead. It’s used like

function read_reg64(base::Ptr{UInt32}, offset::Int)
    lower = volatile_load(base + sizeof(UInt32)*(offset + 1))
    upper = volatile_load(base + sizeof(UInt32)*(offset + 2)) % UInt64

    (upper << 32) | lower
end

though I’m not 100% sure about the +1 business there, since raw unsafe_load already assumes i starts at 1 (i.e. it takes the 1-indexed conversion into account already). I wrote the above assuming you want offset to be -1-based.

unsafe_load(p::Ptr{T}, i::Integer=1)

Load a value of type T from the address of the ith element (1-indexed) starting at p. This is equivalent to the C expression p[i-1].

green.nsk · April 20, 2023, 10:18am

Thank you both, @Sukera @hildebrandmw ! Both solutions work great!

I imagine the @noinline unsafe_load() may indeed be marginally slower, however the underlying PCIe roundtrip likely dwarfs the overhead.

Topic		Replies	Views
How to prevent unwanted "optimization" in SIMD code? Performance question , llvm	8	202	May 18, 2025
Elimination of duplicate atomic loads (more of an llvm question) Internals & Design multithreading	18	494	March 28, 2025
Avoid LLVM setjmp bug Internals	0	1001	December 24, 2016
Any memory locations known to be ok to read across platforms? Doing a VM Performance	2	67	May 22, 2025
Automatic Compiler Optimizations and Multithreading Julia at Scale multithreading , compilation	11	312	July 24, 2024

Disable compiler optimization

Related topics