Understanding generated assembly for simple loop

llvm

#1

I am trying to understand how simple julia functions are converted into assembly.
In this case, I am interested in summing the integer serie up to some argument:

       function foo(n)
           res = zero(n)
           for i ∈ 1:n
                res+=i
           end
       end

Calling:

julia> code_native(foo,(Int64,),:intel)

the result is

	.text
Filename: REPL[12]
	push	rbp
	mov	rbp, rsp
	xor	eax, eax
Source line: 3
	test	rdi, rdi
	jle	L32
	lea	rax, [rdi - 1]
	lea	rcx, [rdi - 2]
	mul	rcx
	shld	rdx, rax, 63
	lea	rax, [rdx + 2*rdi - 1]
Source line: 6
L32:
	pop	rbp
	ret
	nop	word ptr cs:[rax + rax]

The cool thing is that LLVM has replaced the for loop by the closed form solution: foo(N)=N*(N+1)/2
But I am puzzled by the line:

	shld	rdx, rax, 63

Shifts rdx 63 bits to the left while filling the 63 opened bit positions with the 63 most significant bits of rax.
If the least significant bit of rdx is 0, this should do rdx=rax/2 which is good.
However rdx is unitialized and if its least significant bit is set to 1, the result of that operation should be:
rdx=rax/2 + (2^64-1)
which would be wrong.

I am probably missing something obvious.

Thank you.


#2

You explicitly asked for intel flavor so I would assume you know that rdx is the output…

Edit: Sorry I thought you were asking about the direction of the operation.


#3

According to http://www.felixcloutier.com/x86/MUL.html the mul instruction stores into rdx

That’s why I hate x86…


#4

Thank you.
This explains it.

I cannot figure out how to mark your post as the solution.


#5

For information, the purpose of the disassembly exercise was trying to figure out what the fact that julia integer are immutable meant for generated code (if it had a performance impact such as too many allocations).
In this particular example, I was interested in the loop counter.
However LLVM is smart and there is no loop counter in the generated code.

I get the same generated code for:


g(i)=i

function foo(n)
           res = zero(n)
           for i ∈ 1:n
                res+=g(i)
           end
end

ie function g get inlined.

Similarly if I sum the squares, cubes or 4th power (g(i)=i^4), LLVM builds the closed form solution and again there is no loop counter.

So instead I tried this function:

function foobar(n,a)
           res = zero(n)
           for i ∈ 1:n
             res+=a[i]
           end
           res
end
julia>code_native(foobar,(Int64,Array{Int64,1}),:intel)
	.text
Filename: REPL[101]
	xor	eax, eax
Source line: 3
	test	rdi, rdi
	jle	L49
Source line: 4
	mov	r8, qword ptr [rsi]
	mov	rdx, qword ptr [rsi + 24]
	xor	ecx, ecx
	xor	eax, eax
	nop	word ptr cs:[rax + rax]
L32:
	cmp	rcx, rdx
	jae	L50
	add	rax, qword ptr [r8 + 8*rcx]
Source line: 3
	inc	rcx
	cmp	rdi, rcx
	jne	L32
Source line: 6
L49:
	ret
L50:
	push	rbp
	mov	rbp, rsp
Source line: 4
	mov	rdx, rsp
	lea	rax, [rdx - 16]
	mov	rsp, rax
	inc	rcx
	mov	qword ptr [rdx - 16], rcx
	movabs	rcx, jl_bounds_error_ints
	mov	edx, 1
	mov	rdi, rsi
	mov	rsi, rax
	call	rcx
	nop

RAX stores the result res.
It is initialized to zero with XOR EAX, EAX (though I don’t know why this instruction is done twice)
RDI (1st argument) is the input n.
The array a is passed via RSI (2nd argument).
From the workflow, we can infer:
RSI must contain the address of the raw float64 array data which is stored in R8
address RSI+24 must contain the array size which is stored in RDX.
RCX is the counter i-1.
initialized to zero with XOR ECX, ECX
incremented with INC RCX
the computation is done by the line

add	rax, qword ptr [r8 + 8*rcx]

The counter is compared to both n for the loop (label L32) and RDX for the array out of bounds exception (label L50)
Label L49 is used for the fast exit in case n<=0.

I guess it should not be surprising that the loop counter are mutating registers since LLVM has full freedom to optimize the local variable but I wanted to understand the simple things first. Next time I will try to see what happens in function that try to modify integers stored in struct or arrays passed as arguments (there could be multiple external references to those integers).

Interestingly if I use the @inbounds macro:

function foobar(n,a)
           res = zero(n)
           for i ∈ 1:n
             @inbounds res+=a[i]
           end
           res
end

I get:

julia>code_native(foobar,(Int64,Array{Int64,1}),:intel)
	.text
Filename: REPL[107]
	pushq	%rbp
	movq	%rsp, %rbp
	xorl	%eax, %eax
Source line: 3
	testq	%rdi, %rdi
	jle	L156
Source line: 4
	movq	(%rsi), %r8
	xorl	%eax, %eax
	movl	$1, %edx
Source line: 3
	cmpq	$4, %rdi
	jb	L130
	movq	%rdi, %rsi
	andq	$-4, %rsi
	xorl	%eax, %eax
	movq	%rdi, %rcx
	andq	$-4, %rcx
	movl	$1, %edx
	je	L130
	movq	%rsi, %rdx
	orq	$1, %rdx
	leaq	16(%r8), %rax
	pxor	%xmm0, %xmm0
	movq	%rcx, %rsi
	pxor	%xmm1, %xmm1
	nopl	(%rax)
Source line: 4
L80:
	movdqu	-16(%rax), %xmm2
	movdqu	(%rax), %xmm3
	paddq	%xmm2, %xmm0
	paddq	%xmm3, %xmm1
Source line: 3
	addq	$32, %rax
	addq	$-4, %rsi
	jne	L80
Source line: 4
	paddq	%xmm0, %xmm1
	pshufd	$78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
	paddq	%xmm1, %xmm0
	movd	%xmm0, %rax
	cmpq	%rdi, %rcx
Source line: 3
	je	L156
L130:
	incq	%rdi
	subq	%rdx, %rdi
	leaq	-8(%r8,%rdx,8), %rcx
	nopl	(%rax)
Source line: 4
L144:
	addq	(%rcx), %rax
Source line: 3
	addq	$8, %rcx
	decq	%rdi
	jne	L144
Source line: 6
L156:
	popq	%rbp
	retq
	nop

No more calls to jl_bounds_error_ints
I was expecting the same code as before without the 2 lines with the RDX register (array size) and without the code following label L50 but the code is vastly different. There is much more code in total: looks like it is using the XMM registers and SIMD instructions to generate faster code so you get more performance increase than just removing bounds checking.


#6

Loop example with initializing integer array:

function foobar(n,a)
           for i ∈ 1:n
             a[i] = i
           end
end
julia> code_native(foobar,(Int64,Array{Int64,1}),:intel)
	.text
Filename: In[6]
Source line: 2
	test	rdi, rdi
	jle	L37
Source line: 3
	mov	r8, qword ptr [rsi]
	mov	rcx, qword ptr [rsi + 24]
	xor	edx, edx
	nop
L16:
	lea	rax, [rdx + 1]
	cmp	rdx, rcx
	jae	L38
	mov	qword ptr [r8 + 8*rdx], rax
Source line: 2
	cmp	rdi, rax
	mov	rdx, rax
	jne	L16
Source line: 3
L37:
	ret
L38:
	push	rbp
	mov	rbp, rsp
	mov	rdx, rsp
	lea	rcx, [rdx - 16]
	mov	rsp, rcx
	mov	qword ptr [rdx - 16], rax
	movabs	rax, jl_bounds_error_ints
	mov	edx, 1
	mov	rdi, rsi
	mov	rsi, rcx
	call	rax
	nop

Very similar to the previous code.
RCX is now array size.
Counter is shared between RDX and RAX
R8 is the pointer to the raw float64 array.

Now with an integer struct:

mutable struct pair{T}
       x::T
       y::T
end

function setPair(p,x,y)
    p.x=x
    p.y=y
    return
end   
julia>code_native(setPair,(pair{Int64},Int64,Int64),:intel)
	.text
Filename: In[27]
	push	rbp
	mov	rbp, rsp
Source line: 2
	mov	qword ptr [rdi], rsi
Source line: 3
	mov	qword ptr [rdi + 8], rdx
Source line: 4
	pop	rbp
	ret
	nop	dword ptr [rax]

This code is so compact. I was expecting an integer allocation since I thought there could be an immutable integer variable referencing the same address as pair.x or pair.y.

As in:

julia>p=Pair(1,2)
Pair{Int64}(1,2)
julia>ref=p.x
1
julia>setPair(p,0,0)
julia>ref
1
julia>p.x
0

Now I realize that the allocation (+ copy) must happen when the immutable integer ref is created from the mutable struct p.

Thinking about it it makes sense that the struct content should be mutated in place for the struct to be packed optimally like an array of Int64 instead of storing references to both x and y.

This code_native method is really cool!


#7

I’m not sure if you are asking any questions or just using this thread as a blog post but if you want any help, you should specify what you mean by “integer allocation”…


#8

Sorry for the confusion. I am just writing down the results of my simple experiments in case they can be useful to someone else.

By integer allocation, I meant that my initial understanding was that when you do:

ref = p.x

ref becomes an alias for the data at the same memory location as p.x (this is the case in python).

So in routine setPair:

function setPair(p,x,y)
    p.x=x
    p.y=y
    return
end 

We assume x and y are int64 and thus are passed by value (according to julia doc and also what can be seen from experiments above) in register (or stack depending on argument position and calling convention).
If the memory location for the integer aliased by p.x and ref were shared, it would not possible to change p.x in place because ref is an alias for an immutable integer. Thus the only possibility would be to allocate a new memory location for the integer and copy the content of the register and change the x reference in p to point to this new location.
But because julia is packing Pair{Int64} compactly like a C struct (which is a good thing), p.x has to be mutated in place. Thus the copy must occur when doing:

ref = p.x

This is different from python:

In [1]: class Pair:
   ...:     def __init__(self,x,y):
   ...:         self.x = x
   ...:         self.y = y
   ...:         

In [2]: p=Pair(1,2)

In [3]: p.x = 32

In [4]: import sys

In [5]: sys.getrefcount(32)
Out[5]: 178

In [6]: ref = p.x

In [7]: sys.getrefcount(32)
Out[7]: 179

In [8]: p.x=3

In [9]: sys.getrefcount(32)
Out[9]: 178

In [14]: ref
Out[14]: 32

The python way can only work when the struct p is holding a reference to x instead of the value.
I am thinking it could also work if the struct is immutable but this would make garbage collection really complex because ref would hold part of the memory of p…

In trying to understand the behavior, I came across this paragrah (from https://docs.julialang.org/en/stable/manual/types/#Mutable-Composite-Types-1 )

To recap, two essential properties define immutability in Julia:

  • An object with an immutable type is passed around (both in assignment statements and in function calls) by copying, whereas a mutable type is passed around by reference.

  • It is not permitted to modify the fields of a composite immutable type.

While from the point of view of program behavior this may be true, the first point (in particular: immutable type are passed by value in function calls is wrong (from the performance point of view):

julia> struct ImmutablePair{T}
              x::T
              y::T
       end

julia> import Base.sum

julia> function sum(p::ImmutablePair)
               return p.x+p.y
       end
sum (generic function with 17 methods)

julia> code_native(sum,(ImmutablePair{Int64},),:intel)
	.text
Filename: REPL[4]
	push	rbp
	mov	rbp, rsp
Source line: 2
	mov	rax, qword ptr [rdi + 8]
	add	rax, qword ptr [rdi]
	pop	rbp
	ret
	nop	dword ptr [rax]

Struct are passed by reference which makes sense (at least for large struct).
This is even the case for struct which are wrapping a single primitive type:

julia> struct ImmutableWrapper{T}
            x::T
       end

julia> square(w::ImmutableWrapper)=w.x*w.x
square (generic function with 1 method)

julia> code_native(square,(ImmutableWrapper{Int64},),:intel)
	.text
Filename: REPL[16]
	push	rbp
	mov	rbp, rsp
Source line: 1
	mov	rax, qword ptr [rdi]
	imul	rax, rax
	pop	rbp
	ret
	nop	dword ptr [rax]

Wouldn’t it be better to write in the documentation:

  • primitive are passed by values.
  • immutable struct are passed by const reference.
  • mutable struct are passed by reference.

#9

I recently rewrote that bit of the documentation because it was misleading and not entirely correct. Julia guarantees reference semantics for all objects, regardless of mutability. This is known as “pass by sharing” for argument passing since “pass by reference” means that the callee can modify the binding of a variable in the caller, which is not possible in Julia (whereas it is possible in Fortran IIUC and for reference arguments in C++). For immutable objects, however, there is no way to tell the difference between value and reference semantics – mutation is the way you would tell the difference. So the compiler is free to copy immutable values instead of using a pointer to a single memory-allocated instance. But it may do either as it sees fit. For large immutable values, for example, passing them around by pointer is likely to be far more efficient than copying all the data around in registers and between stack frames. In the other direction, while the mental model for mutable objects should be that they are heap-allocated and passed around by pointer, if the compiler can figure out that this isn’t necessary, it may stack-allocate a mutable object or even skip allocating it entirely.

As a general observation, people seem to want guarantees about how things are implemented. However, giving such guarantees – or worse, allowing the manner of implementation to leak into observable aspects or the language – is how one ends up with needlessly slow languages that cannot be optimized because all latitude the compiler has to optimize has been squandered. You really do not want us to make promises about how immutable or mutable objects are allocated and passed around because those promises are precisely what would prevent us from making code faster and more efficient in the future.

All that said, please carry on with the poking around at the internals here! It’s certainly a good thing to do to understand how things work :slight_smile:


#10

No. ref become an alias for the object p.x. Memory address does not appear anywhere.

They actually aren’t. They are passed by value in memory.


#11

@StefanKarpinski

Thank you for the detailed clarification.

To be clear I do not need guarantees about how things are implemented only that it works and is coherent and that it is somehow efficient. I agree that it is good to give more freedom for optimization. I just think that saying that immutable are passed by values/copied is kind of misleading: a developper with a C++ background would then think this is inefficient for large struct and would use a mutable struct instead but what he really wants is a const reference (this is actually what the julia compiler is doing in my example: pass the pointer but give a compilation error if the function tries to modify the data).


#12

I am having some difficulty with this.
p.x does not exist by itself in the case of p being Pair{Int64} since p is then a compact structure which is not holding any references but two values.
Now when p.x is changed in setPair, it is modified in place in memory, but ref is not.
If ref was an alias for the object p.x it would be modified too.

Now it is possible that what you describe is the julia language model/the intermediate LLVM representation of the code (I didn’t look into this) and that after optimisation what is actually happening is vastly different. Since the point of this exercise is actually to understand the efficiency of the generated code, I am actually interested in figuring out where the actual copies and memory allocations (calls to malloc) happen more than in the language model for this particular exercise.


#13

It’s referencing two objects, which are stored inline. Yes, this is the semantics which isn’t about assembly code but what you’ve stated above is actually a mix about the semantic and implementation. If the semantic part is already clear then you can ignore this part.

No. you assigned to the field and didn’t mutate the object so that can’t happen.

That’s not really true and we ARE passing the struct by value and looking at the callee code like this really won’t tell you anything about copying. FWIW, there isn’t enough registers to hold arbitrarily large struct so passing by value for big enough struct will always use memory, just because you see a pointer or not does not say anything about if the object is passed by value or not.


#14

Yes you are right. I made a big mistake in my conclusion: I cannot conclude on the copying from the callee since it is usually up to the caller to do the copying if needed (however I would say the number of registers doesn’t matter for this since the stack can store large struct).
Maybe I am reading you incorrectly but it seems you are implying that an actual copy of the immutable struct is made and the routine is receiving a pointer to the copy (which I would think is likely placed on the stack frame of the caller since it is local to the caller).
Considering the fact that the immutable struct can contain mutable objects, it has to be a shallow copy.
But if there is an actual copy, then my question is, why is this copy even necessary? or is the compiler optimization process replacing the copy by a passing a reference to the original object?

I have noticed that there is many different usage of the term pass by value (eg Java is “passing object reference by value”). So to make it clearer I am comparing here with C++.
In C++ you have the following possibilities to pass a variable (not exhaustive I didn’t consider eg pointers and rvalue references):

pass by value: function(Object x)
This calls the copy constructor in the caller (or if you are lucky to pass in a rvalue and you have a move constructor defined, this calls the move constructor).
It is only recommended if you want to get a copy of the object to be modified locally in the callee while the original object in the caller must be left untouched (this is obviously not what we want here so this recommendation does not apply).
pass by reference: function(Object & x)
This is recommended if you want to modify the object.
pass by const reference: function(const Object& x)
This is recommended if you want to access the object in read only mode.

C++ const is not the same as Julia immutable because in Julia you can still mutate certain objects contained in the immutable struct.
For the case of an immutable struct, both a copy (up to a certain level only not to copy the mutable content ) and pass by reference would work, but the pass by reference is cheaper.
Moreover the immutability is a property of the type so it is still enforced in the callee so passing the reference to the actual object should not be a problem. I might be missing something important.

Thank you.


#15

Correct. I’m not in a position to check the actual code so what I’m stating here could be off though I’m 98% sure it’s correct. What we do is to pass a copy by pointer but tell LLVM that the copy won’t be mutated. This means that if LLVM think it’s ok it can elide the copy. This should work reasonably well for local variables but it will likely not be the case for fields. For sth like struct A; x::NTuple{10,Int}; end a function that does f(a.x) may not be able to elide the copy if it can’t proof that a is not mutated/accessed elsewhere. That’s why I would highly recommend you to look into this from a higher level first instead of looking at only the final assembly.

For single thread program, the only significance of const type in C++ is dispatch. (const storage of course has other significance). It’s NOT a undefined behavior to cast the const out.