Machine code discrepancy between Julia and C


#1

I’m seeing an extra instruction in the machine code generated by Julia compared to the same thing from C and clang for a very simple function. Can anyone explain? I’m trying to devise a good example illustrating code introspection, type inference, and JIT for a “Why Julia?” talk.

julia> f(x) = x*x + 3
f (generic function with 1 method)

julia> @code_native(f(1.0))
        .text
Filename: REPL[46]
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        mulsd   %xmm0, %xmm0
        movabsq $140682184883512, %rax  # imm = 0x7FF31FA7FD38
        addsd   (%rax), %xmm0
        popq    %rbp
        retq
        nopl    (%rax,%rax)

Note the movabsq $14068... instruction. That’s moving contents of register %rax to memory, right? It’s not there in the machine code from C.

sophist$ cat f.c
double f(double x) {
  return x*x + 3;
}
sophist$ clang -O2 -c f.c ; objdump -d f.o   # cut out some irrelevant output
0000000000000000 <f>:
   0:   f2 0f 59 c0             mulsd  %xmm0,%xmm0
   4:   f2 0f 58 05 00 00 00    addsd  0x0(%rip),%xmm0        # c <f+0xc>
   b:   00 
   c:   c3                      retq   
gibson@sophist$ 

I’m running clang version 4.0.1 and generic linux binary for julia-0.6 on Intel x86-64.


#2

No it’s moving a constant into register.

The only difference is that clang uses a RIP relative addressing.


#3

And now the long answer.

Also the difference between the code generated is that (I’m not sure this detail fits a intro talk though) for jitting we are using large code model with static relocation while the C code is likely compiles for a small code model with PIC relocation.

Large code model means that the (code and data) section offsets can be of any size (up to machine bits size) while small code model means that the you can assume all offsets to be within 32bits. Due to the constraint in instruction encoding, this makes a difference as for what instruction you can use to generate addresses on 64bit platforms. For JIT, large code model is typically used since it’s very hard to guarantee sections being allocated at runtime being closed to each other. (They usually do, it’s just hard to be sure). For static compilation, you can use small code model since the final library being compiled is typically smaller than (and can be loaded at runtime within) 2GB so you can safely use a signed 32bit offsets everywhere.

Static relocation means that the code will be compiled only for one load address, which can be slightly more efficient and will be fine for JIT since we never move code around. It’ll be a problem for shared libraries (and ASLR for executables) since the load address can’t be determined at compile time. That’s why the C compilers produces position-independent code (PIC) which uses the current instruction address (PC for program counter or the %rip register on X86) to find other part of the binary with an offset.

So now in both cases the compiler needs to emit a load to a xmm register since ACAICT that’s the fastest way to initialize it (no immediate load for xmm registers). The JIT generates the address with an immediate move with a 64bit constant. The C compiler generates the address to do that in the load instruction directly with a PC relative load with an 32bit offset. The offset you see in the .o is currently 0 since it’s the linker’s job to decide where to position the sections and fill in that offset.