# Understanding generated assembly for simple loop

I am trying to understand how simple julia functions are converted into assembly.
In this case, I am interested in summing the integer serie up to some argument:

``````       function foo(n)
res = zero(n)
for i ∈ 1:n
res+=i
end
end
``````

Calling:

``````julia> code_native(foo,(Int64,),:intel)
``````

the result is

``````	.text
Filename: REPL
push	rbp
mov	rbp, rsp
xor	eax, eax
Source line: 3
test	rdi, rdi
jle	L32
lea	rax, [rdi - 1]
lea	rcx, [rdi - 2]
mul	rcx
shld	rdx, rax, 63
lea	rax, [rdx + 2*rdi - 1]
Source line: 6
L32:
pop	rbp
ret
nop	word ptr cs:[rax + rax]

``````

The cool thing is that LLVM has replaced the for loop by the closed form solution: foo(N)=N*(N+1)/2
But I am puzzled by the line:

``````	shld	rdx, rax, 63
``````

Shifts rdx 63 bits to the left while filling the 63 opened bit positions with the 63 most significant bits of rax.
If the least significant bit of rdx is 0, this should do rdx=rax/2 which is good.
However rdx is unitialized and if its least significant bit is set to 1, the result of that operation should be:
rdx=rax/2 + (2^64-1)
which would be wrong.

I am probably missing something obvious.

Thank you.

1 Like

You explicitly asked for intel flavor so I would assume you know that `rdx` is the output…

Edit: Sorry I thought you were asking about the direction of the operation.

According to MUL — Unsigned Multiply the `mul` instruction stores into `rdx`

That’s why I hate x86…

4 Likes

Thank you.
This explains it.

I cannot figure out how to mark your post as the solution.

For information, the purpose of the disassembly exercise was trying to figure out what the fact that julia integer are immutable meant for generated code (if it had a performance impact such as too many allocations).
In this particular example, I was interested in the loop counter.
However LLVM is smart and there is no loop counter in the generated code.

I get the same generated code for:

``````
g(i)=i

function foo(n)
res = zero(n)
for i ∈ 1:n
res+=g(i)
end
end
``````

ie function g get inlined.

Similarly if I sum the squares, cubes or 4th power (g(i)=i^4), LLVM builds the closed form solution and again there is no loop counter.

So instead I tried this function:

``````function foobar(n,a)
res = zero(n)
for i ∈ 1:n
res+=a[i]
end
res
end
``````
``````julia>code_native(foobar,(Int64,Array{Int64,1}),:intel)
.text
Filename: REPL
xor	eax, eax
Source line: 3
test	rdi, rdi
jle	L49
Source line: 4
mov	r8, qword ptr [rsi]
mov	rdx, qword ptr [rsi + 24]
xor	ecx, ecx
xor	eax, eax
nop	word ptr cs:[rax + rax]
L32:
cmp	rcx, rdx
jae	L50
add	rax, qword ptr [r8 + 8*rcx]
Source line: 3
inc	rcx
cmp	rdi, rcx
jne	L32
Source line: 6
L49:
ret
L50:
push	rbp
mov	rbp, rsp
Source line: 4
mov	rdx, rsp
lea	rax, [rdx - 16]
mov	rsp, rax
inc	rcx
mov	qword ptr [rdx - 16], rcx
movabs	rcx, jl_bounds_error_ints
mov	edx, 1
mov	rdi, rsi
mov	rsi, rax
call	rcx
nop
``````

RAX stores the result res.
It is initialized to zero with XOR EAX, EAX (though I don’t know why this instruction is done twice)
RDI (1st argument) is the input n.
The array a is passed via RSI (2nd argument).
From the workflow, we can infer:
RSI must contain the address of the raw float64 array data which is stored in R8
address RSI+24 must contain the array size which is stored in RDX.
RCX is the counter i-1.
initialized to zero with XOR ECX, ECX
incremented with INC RCX
the computation is done by the line

``````add	rax, qword ptr [r8 + 8*rcx]
``````

The counter is compared to both n for the loop (label L32) and RDX for the array out of bounds exception (label L50)
Label L49 is used for the fast exit in case n<=0.

I guess it should not be surprising that the loop counter are mutating registers since LLVM has full freedom to optimize the local variable but I wanted to understand the simple things first. Next time I will try to see what happens in function that try to modify integers stored in struct or arrays passed as arguments (there could be multiple external references to those integers).

Interestingly if I use the @inbounds macro:

``````function foobar(n,a)
res = zero(n)
for i ∈ 1:n
@inbounds res+=a[i]
end
res
end
``````

I get:

``````julia>code_native(foobar,(Int64,Array{Int64,1}),:intel)
.text
Filename: REPL
pushq	%rbp
movq	%rsp, %rbp
xorl	%eax, %eax
Source line: 3
testq	%rdi, %rdi
jle	L156
Source line: 4
movq	(%rsi), %r8
xorl	%eax, %eax
movl	\$1, %edx
Source line: 3
cmpq	\$4, %rdi
jb	L130
movq	%rdi, %rsi
andq	\$-4, %rsi
xorl	%eax, %eax
movq	%rdi, %rcx
andq	\$-4, %rcx
movl	\$1, %edx
je	L130
movq	%rsi, %rdx
orq	\$1, %rdx
leaq	16(%r8), %rax
pxor	%xmm0, %xmm0
movq	%rcx, %rsi
pxor	%xmm1, %xmm1
nopl	(%rax)
Source line: 4
L80:
movdqu	-16(%rax), %xmm2
movdqu	(%rax), %xmm3
Source line: 3
jne	L80
Source line: 4
pshufd	\$78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
movd	%xmm0, %rax
cmpq	%rdi, %rcx
Source line: 3
je	L156
L130:
incq	%rdi
subq	%rdx, %rdi
leaq	-8(%r8,%rdx,8), %rcx
nopl	(%rax)
Source line: 4
L144:
Source line: 3
decq	%rdi
jne	L144
Source line: 6
L156:
popq	%rbp
retq
nop
``````

No more calls to jl_bounds_error_ints
I was expecting the same code as before without the 2 lines with the RDX register (array size) and without the code following label L50 but the code is vastly different. There is much more code in total: looks like it is using the XMM registers and SIMD instructions to generate faster code so you get more performance increase than just removing bounds checking.

1 Like

Loop example with initializing integer array:

``````function foobar(n,a)
for i ∈ 1:n
a[i] = i
end
end
``````
``````julia> code_native(foobar,(Int64,Array{Int64,1}),:intel)
.text
Filename: In
Source line: 2
test	rdi, rdi
jle	L37
Source line: 3
mov	r8, qword ptr [rsi]
mov	rcx, qword ptr [rsi + 24]
xor	edx, edx
nop
L16:
lea	rax, [rdx + 1]
cmp	rdx, rcx
jae	L38
mov	qword ptr [r8 + 8*rdx], rax
Source line: 2
cmp	rdi, rax
mov	rdx, rax
jne	L16
Source line: 3
L37:
ret
L38:
push	rbp
mov	rbp, rsp
mov	rdx, rsp
lea	rcx, [rdx - 16]
mov	rsp, rcx
mov	qword ptr [rdx - 16], rax
movabs	rax, jl_bounds_error_ints
mov	edx, 1
mov	rdi, rsi
mov	rsi, rcx
call	rax
nop
``````

Very similar to the previous code.
RCX is now array size.
Counter is shared between RDX and RAX
R8 is the pointer to the raw float64 array.

Now with an integer struct:

``````mutable struct pair{T}
x::T
y::T
end

function setPair(p,x,y)
p.x=x
p.y=y
return
end
``````
``````julia>code_native(setPair,(pair{Int64},Int64,Int64),:intel)
.text
Filename: In
push	rbp
mov	rbp, rsp
Source line: 2
mov	qword ptr [rdi], rsi
Source line: 3
mov	qword ptr [rdi + 8], rdx
Source line: 4
pop	rbp
ret
nop	dword ptr [rax]

``````

This code is so compact. I was expecting an integer allocation since I thought there could be an immutable integer variable referencing the same address as pair.x or pair.y.

As in:

``````julia>p=Pair(1,2)
Pair{Int64}(1,2)
julia>ref=p.x
1
julia>setPair(p,0,0)
julia>ref
1
julia>p.x
0
``````

Now I realize that the allocation (+ copy) must happen when the immutable integer ref is created from the mutable struct p.

Thinking about it it makes sense that the struct content should be mutated in place for the struct to be packed optimally like an array of Int64 instead of storing references to both x and y.

This code_native method is really cool!

I’m not sure if you are asking any questions or just using this thread as a blog post but if you want any help, you should specify what you mean by “integer allocation”…

Sorry for the confusion. I am just writing down the results of my simple experiments in case they can be useful to someone else.

By integer allocation, I meant that my initial understanding was that when you do:

``````ref = p.x
``````

ref becomes an alias for the data at the same memory location as p.x (this is the case in python).

So in routine setPair:

``````function setPair(p,x,y)
p.x=x
p.y=y
return
end
``````

We assume x and y are int64 and thus are passed by value (according to julia doc and also what can be seen from experiments above) in register (or stack depending on argument position and calling convention).
If the memory location for the integer aliased by p.x and ref were shared, it would not possible to change p.x in place because ref is an alias for an immutable integer. Thus the only possibility would be to allocate a new memory location for the integer and copy the content of the register and change the x reference in p to point to this new location.
But because julia is packing Pair{Int64} compactly like a C struct (which is a good thing), p.x has to be mutated in place. Thus the copy must occur when doing:

``````ref = p.x
``````

This is different from python:

``````In : class Pair:
...:     def __init__(self,x,y):
...:         self.x = x
...:         self.y = y
...:

In : p=Pair(1,2)

In : p.x = 32

In : import sys

In : sys.getrefcount(32)
Out: 178

In : ref = p.x

In : sys.getrefcount(32)
Out: 179

In : p.x=3

In : sys.getrefcount(32)
Out: 178

In : ref
Out: 32

``````

The python way can only work when the struct p is holding a reference to x instead of the value.
I am thinking it could also work if the struct is immutable but this would make garbage collection really complex because ref would hold part of the memory of p…

In trying to understand the behavior, I came across this paragrah (from https://docs.julialang.org/en/stable/manual/types/#Mutable-Composite-Types-1 )

To recap, two essential properties define immutability in Julia:

• An object with an immutable type is passed around (both in assignment statements and in function calls) by copying, whereas a mutable type is passed around by reference.

• It is not permitted to modify the fields of a composite immutable type.

While from the point of view of program behavior this may be true, the first point (in particular: immutable type are passed by value in function calls is wrong (from the performance point of view):

``````julia> struct ImmutablePair{T}
x::T
y::T
end

julia> import Base.sum

julia> function sum(p::ImmutablePair)
return p.x+p.y
end
sum (generic function with 17 methods)

julia> code_native(sum,(ImmutablePair{Int64},),:intel)
.text
Filename: REPL
push	rbp
mov	rbp, rsp
Source line: 2
mov	rax, qword ptr [rdi + 8]
pop	rbp
ret
nop	dword ptr [rax]
``````

Struct are passed by reference which makes sense (at least for large struct).
This is even the case for struct which are wrapping a single primitive type:

``````julia> struct ImmutableWrapper{T}
x::T
end

julia> square(w::ImmutableWrapper)=w.x*w.x
square (generic function with 1 method)

julia> code_native(square,(ImmutableWrapper{Int64},),:intel)
.text
Filename: REPL
push	rbp
mov	rbp, rsp
Source line: 1
mov	rax, qword ptr [rdi]
imul	rax, rax
pop	rbp
ret
nop	dword ptr [rax]
``````

Wouldn’t it be better to write in the documentation:

• primitive are passed by values.
• immutable struct are passed by const reference.
• mutable struct are passed by reference.

I recently rewrote that bit of the documentation because it was misleading and not entirely correct. Julia guarantees reference semantics for all objects, regardless of mutability. This is known as “pass by sharing” for argument passing since “pass by reference” means that the callee can modify the binding of a variable in the caller, which is not possible in Julia (whereas it is possible in Fortran IIUC and for reference arguments in C++). For immutable objects, however, there is no way to tell the difference between value and reference semantics – mutation is the way you would tell the difference. So the compiler is free to copy immutable values instead of using a pointer to a single memory-allocated instance. But it may do either as it sees fit. For large immutable values, for example, passing them around by pointer is likely to be far more efficient than copying all the data around in registers and between stack frames. In the other direction, while the mental model for mutable objects should be that they are heap-allocated and passed around by pointer, if the compiler can figure out that this isn’t necessary, it may stack-allocate a mutable object or even skip allocating it entirely.

As a general observation, people seem to want guarantees about how things are implemented. However, giving such guarantees – or worse, allowing the manner of implementation to leak into observable aspects or the language – is how one ends up with needlessly slow languages that cannot be optimized because all latitude the compiler has to optimize has been squandered. You really do not want us to make promises about how immutable or mutable objects are allocated and passed around because those promises are precisely what would prevent us from making code faster and more efficient in the future.

All that said, please carry on with the poking around at the internals here! It’s certainly a good thing to do to understand how things work 12 Likes

No. `ref` become an alias for the object `p.x`. Memory address does not appear anywhere.

They actually aren’t. They are passed by value in memory.

@StefanKarpinski

Thank you for the detailed clarification.

To be clear I do not need guarantees about how things are implemented only that it works and is coherent and that it is somehow efficient. I agree that it is good to give more freedom for optimization. I just think that saying that immutable are passed by values/copied is kind of misleading: a developper with a C++ background would then think this is inefficient for large struct and would use a mutable struct instead but what he really wants is a const reference (this is actually what the julia compiler is doing in my example: pass the pointer but give a compilation error if the function tries to modify the data).

I am having some difficulty with this.
p.x does not exist by itself in the case of p being Pair{Int64} since p is then a compact structure which is not holding any references but two values.
Now when p.x is changed in setPair, it is modified in place in memory, but ref is not.
If ref was an alias for the object p.x it would be modified too.

Now it is possible that what you describe is the julia language model/the intermediate LLVM representation of the code (I didn’t look into this) and that after optimisation what is actually happening is vastly different. Since the point of this exercise is actually to understand the efficiency of the generated code, I am actually interested in figuring out where the actual copies and memory allocations (calls to malloc) happen more than in the language model for this particular exercise.

It’s referencing two objects, which are stored inline. Yes, this is the semantics which isn’t about assembly code but what you’ve stated above is actually a mix about the semantic and implementation. If the semantic part is already clear then you can ignore this part.

No. you assigned to the field and didn’t mutate the object so that can’t happen.

That’s not really true and we ARE passing the struct by value and looking at the callee code like this really won’t tell you anything about copying. FWIW, there isn’t enough registers to hold arbitrarily large struct so passing by value for big enough struct will always use memory, just because you see a pointer or not does not say anything about if the object is passed by value or not.

Yes you are right. I made a big mistake in my conclusion: I cannot conclude on the copying from the callee since it is usually up to the caller to do the copying if needed (however I would say the number of registers doesn’t matter for this since the stack can store large struct).
Maybe I am reading you incorrectly but it seems you are implying that an actual copy of the immutable struct is made and the routine is receiving a pointer to the copy (which I would think is likely placed on the stack frame of the caller since it is local to the caller).
Considering the fact that the immutable struct can contain mutable objects, it has to be a shallow copy.
But if there is an actual copy, then my question is, why is this copy even necessary? or is the compiler optimization process replacing the copy by a passing a reference to the original object?

I have noticed that there is many different usage of the term pass by value (eg Java is “passing object reference by value”). So to make it clearer I am comparing here with C++.
In C++ you have the following possibilities to pass a variable (not exhaustive I didn’t consider eg pointers and rvalue references):

pass by value: function(Object x)
This calls the copy constructor in the caller (or if you are lucky to pass in a rvalue and you have a move constructor defined, this calls the move constructor).
It is only recommended if you want to get a copy of the object to be modified locally in the callee while the original object in the caller must be left untouched (this is obviously not what we want here so this recommendation does not apply).
pass by reference: function(Object & x)
This is recommended if you want to modify the object.
pass by const reference: function(const Object& x)
This is recommended if you want to access the object in read only mode.

C++ const is not the same as Julia immutable because in Julia you can still mutate certain objects contained in the immutable struct.
For the case of an immutable struct, both a copy (up to a certain level only not to copy the mutable content ) and pass by reference would work, but the pass by reference is cheaper.
Moreover the immutability is a property of the type so it is still enforced in the callee so passing the reference to the actual object should not be a problem. I might be missing something important.

Thank you.

Correct. I’m not in a position to check the actual code so what I’m stating here could be off though I’m 98% sure it’s correct. What we do is to pass a copy by pointer but tell LLVM that the copy won’t be mutated. This means that if LLVM think it’s ok it can elide the copy. This should work reasonably well for local variables but it will likely not be the case for fields. For sth like `struct A; x::NTuple{10,Int}; end` a function that does `f(a.x)` may not be able to elide the copy if it can’t proof that `a` is not mutated/accessed elsewhere. That’s why I would highly recommend you to look into this from a higher level first instead of looking at only the final assembly.

For single thread program, the only significance of `const` type in C++ is dispatch. (`const` storage of course has other significance). It’s NOT a undefined behavior to cast the `const` out.

1 Like