Julia's assignment behavior differs from Fortran?

Maybe this will help. Note that y1 is just a pointer to the M object. So is y2, consequently when you modify the inside of y2 using the name y2, since the name y1 points to the same thing, y1 also looks like its modified.

In contrast, you can’t modify x2.x because its not a mutable struct. The “two names for the same object” logic doesn’t work.

julia> mutable struct M
       x
       end

julia> struct P 
       x 
       end

julia> y1 = M(1);

julia> y2 = y1
M(1)

julia> y2.x = 5;

julia> y1
M(5)

julia> x1 = P(1)
P(1)

julia> x2 = x1;

julia> x2.x = 5
ERROR: setfield! immutable struct of type P cannot be changed

Ok thanks for the example pdeffebach, I will check it out.

Hunh? It definitely still works. You still have two names for the same object — but the object cannot change.

Mixing up mutation and naming is part of why I think this is so confusing to newcomers. They’re totally independent concepts. Naming works exactly the same, regardless of mutability.

8 Likes

Yes for sure. By this logic I only mean only the thinking "if I change this variable it shows up in y1"

This reminds of something I read about Python years ago.

Someone compared a python object to a cat that travels between two houses. In one house the cat is called Tigger and in the other it’s called Fluffy. The cat doesn’t know its name.

10 Likes

I may have read this wrong, but I think some part of the confusion may be an expectation that code looks and works more like math. Let’s say this little bit here was just math on paper:

xNew = xOld + 1
xOld = xNew

This would indeed imply an impossible equation xOld = xOld + 1. But = and xOld = xOld + 1 don’t mean that in Julia (or Python, as mentioned earlier). I don’t know Fortran so I can’t say for sure, but maybe Fortran was designed to look and work a lot more like math.
A lot of these concepts cropping up (assignment, references, mutability) may be confusing at first but are good to learn. And people here are very willing to help out.

In Fortran you would have declared xOld and xNew at the beggining of the code giving the sensation that it each is a variable with a specific position in memory, without any ambiguity. Therefore,

xNew = xOld

copies the value of xOld into xNew.

Now that I am immersed in this world of labels, etc, I understand that what we think is going on might not be the actual thing the program does when it is compiled and executed (the positions in memory of xNew might not exist in practice), but that is not relevant for the user.

Concerning the mutable vs. immutable thing, from the perspective of a previous Fortran user: In Fortran everything seems to have its place in memory, as I said, and everything seems to be mutable, although that might not be true in practice. Thus, there is an abstraction layer there that must be overcome. I hope what I say in what follows is not too wrong.

One learns in this process that values can “exist” in different ways. A variable may occupy a place in one type of memory, the type of memory that we understood that existed, which is called the “heap”. The variables in the heap have an address to the position in memory they occupy and, thus, the value that they assume can be changed, by modifying the content of that position in the memory. This is where mutable variables are stored.

In Fortran, from a user perspective, everything seems to be in the “heap” (although that might not be true, the compiler will decide that), in such a way that one can program as if every variable had an address in memory and its value can be modified by modifying the content of that position in the memory. Thus, everything seems to be mutable in Fortran. Additionally, labels are assigned forever to the same positions in memory.

Now we learn that some variables might exist in other types of memory, the “stack” and (I guess) the processor cache. These types of memory are much faster than the “heap” to work with, and if a variable can be assigned to these places your code will be faster. However, the values occupying these types of memory positions do not have an address in the usual sense. You cannot change the value associated to that position in memory because that value in that position in memory is not persistent, that is, it will be discarded as soon as possible. Even if the value will be used later, it might be that it is copied somewhere else in the stack without our knowledge if that results to be the best strategy for performance. We do not control where these values are stored and, then, we cannot assign different values for these variables, because this actually does not make sense, they are only values stored somewhere in a fast access memory.

Thus we learn that in a loop like:

s = 0.
for i in 1:3
   x = 2*i
   s = s + x
end

x might not be allocated in memory at all. It might occupy a temporary place in the fast stack memory or, even, only in the processor cache. In general we don’t know what is going on with x, and we should not care about that, the people that implemented the compilers are much smarter than us and implemented most likely the best choice. Perhaps it will be stored in the slow “heap” memory, with an address, particularly if it was a huge array instead of a simple scalar, but it doesn’t mater. (in this case probably it is just inlined, but the idea is the same)

A Fortran user is suprised that a loop like that does not allocate anything. We learn that everything has its place in memory, even the counter of the loop, so that code should at least allocate some values. Yet, now we discover that these allocations “do not count”, because are fast allocations in these non-addressed types of memory.

But we have to learn that for the compiler have freedom to choose what to do with x, the content of x cannot change. Thus, it must be immutable. In the loop above, it doesn’t even make sense calling x the same variable at each loop iteration. It is just a temporary value assigned to some fast memory position that will be used and discarded.

Therefore, if we write

x = 1 
x = 2

the two x are probably just two completely independent values stored in these fast memories. Both the “first x” and the “second x” are immutable. Actually what is immutable is the Integer values 1 and 2, and x is only a temporary label to one or other of this values. The first x will be discarded when convenient without we knowing where it was stored, if it was stored at all.

Yet, if we write

x = Vector{Int}(undef,1)
x[1] = 1
x[1] = 2

we are assuming that for some reason you want to access the same position in memory repeatedly, and this must be stored in the slower heap memory, where things have real addresses. This x is a mutable object, you can actually change the content associated with the position it occupies in memory explicitly.

Later we learn that vectors can also be immutable (why not? If a number can be stored in these fast memories, why not a bunch of numbers?). And we can use StaticArrays where small vectors behave the same as any other immutable value, like a number. This means that:

julia> function f()
         s = 0.
         for i in 1:1000
           x = SVector{3,Float64}(i, sqrt(i), i^2)
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
f (generic function with 1 method)

julia> f()
3.343550974558874e8

julia> @allocated f()
0

Wait, that function that generates 1000 vectors of dimension 3 does not allocate anything? Yet it doesn’t, because these static arrays are immutable, so they only exist in the fast memory positions which are temporary. Knowing this allows a bunch of code optimizations which are very cool, and a very pretty syntax if you are dealing with particle simulations. For instance, you can do:

julia> x = [ SVector{3,Float64}(1,1,1) for i in 1:3 ]; # positions of 3 particles

julia> function update_positions!(x)
         for i in 1:length(x)
           y = 2*x[i] # whatever needed
           x[i] = y 
         end
       end
update_positions! (generic function with 1 method)

julia> update_positions!(x)

julia> x
10-element Array{SArray{Tuple{3},Float64,1,3},1}:
 [2.0, 2.0, 2.0]
 [2.0, 2.0, 2.0]
 [2.0, 2.0, 2.0]

julia> @allocated update_positions!(x)
0

The update_positions! function is mutating the elements of x (x is mutable), but the elements of x are themselves immutable static vectors. This is, the line y = 2*x[i] is just creating a new static vector, and x[i] = y is not actually modifying the values of the positions in memory of the elements of x[i] as we would think (or it might be, that is just not your problem), and all that does not involve any access to the slow heap memory of the computer. Thus you can deal with a vector as fast you can deal with a scalar.

The possibilities to improve the performance of a numerical code increase. I have been able to write faster codes in Julia than in Fortran now, but that depends on some adaptation with that new way of thinking and with the new possibilities involved.

10 Likes

Just to add one thing. There is nothing mysterious about StaticArrays. They are convenient immutable structures, which you could have defined yourself, with the same allocation results:

julia> struct P
         x :: Float64
         y :: Float64
         z :: Float64
       end

julia> function update_positions!(x)
         for i in 1:length(x)
           y = P( 2*x[i].x, 2*x[i].y, 2*x[i].z )
           x[i] = y   
         end
       end
update_positions! (generic function with 1 method)

julia> x = [ P(1.0,1.0,1.0) for i in 1:100 ];

julia> update_positions!(x);

julia> @allocated update_positions!(x)
0

3 Likes

The fact that x = , x .= and x[i] = mean totally different things is quite unintuitive, and I remember being confused by this when I started julia. This doesn’t look really covered in the docs, at least not at a place a beginner is going to look at. Might be a good idea to add a small section on arrays? The docs on “noteworthy difference from matlab says” “Julia arrays are not copied when assigned to another variable. After A = B , changing elements of B will modify A as well.” which is good but might be a bit more explicit? Also should there be a “noteworthy differences from fortran”?

6 Likes

I think I understand what is going on under the hood based on this discussion. Can someone confirm that my comments are accurate?

x=[2]  # x points to memory location m1
y=x    # y points to memory location m1
x=[3]  # x points to memory location m2, y still points to m1
z=x    # z points to memory location m2
x[1]=4 # m2 changes value in place, affecting all variables that point there
julia> println(x,y,z)
[4][2][4]
6 Likes

There is an attempt to build that list here: Noteworthy differences from Fortran

1 Like

= is not an equality test, but an assignment. Confusing, perhaps. But entirely consistent across many languages.

1 Like

When I learned that the syntax x[i] = v is converted to setindex!(x, i, v) in Julia, I didn’t need to think too much about what is mutable or not, it is pretty clear that something is being set to a new value. The syntax x = v is a change of name as others have already pointed out.

2 Likes

FWIW, Fortran compilers normally don’t have a problem putting all the arrays you’re mutating on the stack. For example, see the gfortran flag -fstack-arrays.
In Julia, StaticArrays.MArrays will also generally be placed on the stack, so long as they do not escape the function creating them or get passed as an argument to a function that isn’t inlined.

A feature request I have in Julia (maybe I’ll make a PR) is support for alloca, so it’s easier to stack allocate memory that’s freed when the function returns, but that you can pass as an argument to non-inlined functions. llvmcall doesn’t work for this, because the memory is immediately freed when the llvmcall returns, not when the Julia function calling llvmcall returns.

julia> using StaticArrays, BenchmarkTools

julia> function f()
         s = 0.
         for i in 1:1000
           x = SVector{3,Float64}(i, sqrt(i), i^2)
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
f (generic function with 1 method)

julia> function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime f()
  2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8

julia> @btime g()
  2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8

However, SVectors are easier for the compiler; using the MVector can prevent some optimizations; see what happens when we add @fastmath:

julia> @fastmath function f()
         s = 0.
         for i in 1:1000
           x = SVector{3,Float64}(i, sqrt(i), i^2)
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
f (generic function with 1 method)

julia> @fastmath function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime f()
  744.343 ns (0 allocations: 0 bytes)
3.3435509745588744e8

julia> @btime g()
  2.604 μs (0 allocations: 0 bytes)
3.343550974558874e8

Suddenly f (using SVectors) is much faster, but the speed of g() is almost unchanged.

But my point here is that, just because the MVector is mutable and getindex and setindex! are defined using loads and stores to/from their pointers doesn’t mean it has to be heap allocated.

EDIT:
The simplest way to speed up g is to use @inbounds:

julia> @fastmath function h()
         s = 0.
         @inbounds for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
h (generic function with 1 method)

julia> @btime h()
  744.317 ns (0 allocations: 0 bytes)
3.3435509745588744e8

A less simple way is to manually unroll the inner loop:

julia> @fastmath function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           Base.Cartesian.@nexprs 3 j -> begin
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime g()
  744.324 ns (0 allocations: 0 bytes)
3.3435509745588744e8

I guess it’s close to a limit where the compiler gives up optimizing it.

EDIT:
It just occurred to me that AVX512 is probably needed for the @fastmath versions of this function to be fast, because AVX512 is needed for SIMD Int64 -> Float64 conversion.
If you don’t have AVX512, I’d recommend

julia> @fastmath function h2()
         s = 0.; i = 0.0
         @inbounds for _ in 1:1000
           x = MVector{3,Float64}(undef);
           i += 1
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
h2 (generic function with 1 method)

julia> @btime h2()
  711.417 ns (0 allocations: 0 bytes)
3.3435509745588744e8
5 Likes

I will post an example of the use of immutable types to compare with a Fortran routine. I do this to learn things myself, so why not share… I think this is an example on how the Julia syntax is nice when we get used to it:

Problem: Given the positions of N particles in 3D, update the positions of a subset of size M of these particles defined in a list.

Using Julia, I will first define an immutable struct to contain the coordinates, and a constructor that updated the particles with the transformation desired (in this case, multiply the coordinates by 2):

struct P{T}
  x :: T
  y :: T
  z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )  

Let us generate a vector of these particles:

n = 1_000_000
x = [ P(1.,1.,1.) for i in 1:n ]

For the Fortran routine, the natural way will be to create a matrix, so we do:

y = ones(3,n)

We will create a list of indexes of the particles that should have their positions updated, with:

m = 1_000
list = rand(1:n,m)

Then, the Julia function that updates the positions of these m particles is:

function f1!(x,list)
  for i in list
    x[i] = update_position(x[i])
  end
end    

The Fortran routines to do the same job are:

subroutine update_position(i,n,x)
  implicit none
  integer :: j, n
  integer*8 i
  double precision :: x(3,n)
  do j = 1, 3
    x(j,i) = 2*x(j,i)
  end do
end subroutine update_position

subroutine f1(n,x,m,list)
  implicit none
  integer :: i, j, n, m
  integer*8 :: list(m)
  double precision :: x(3,n)
  do i = 1, m
    call update_position(list(i),n,x)
  end do
end subroutine f1  

I think that it is pretty clear that the Julia code is much simpler and clear. Bugs turn out to be easier to find at the end, even with the more permissive syntax of Julia. Of course we can suggest changes in both codes, but in particular if we want to keep the function that updates the position of a single particle separated from the one that updates the lists, the Fortran code code will not be much simpler than that.

Finally, the Julia function turns out to be even slightly faster than the Fortran routine:

 Julia with P:   1.869 μs (0 allocations: 0 bytes)
 Fortran:   2.292 μs (0 allocations: 0 bytes)

(although I know that benchmarks are always complicated and a disclaimer: if we wanted to update the positions of all particles, which are contiguous in memory, than we should use a matrix of particles in row-order, allowing loop vectorizations. The case here is that the positions to be updated are not contiguous in memory).

The complete code is here:

Summary
using Test
using BenchmarkTools

# Using user defined immutable type 
struct P{T}
  x :: T
  y :: T
  z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )

function f1!(x,list)
  for i in list
    x[i] = update_position(x[i])
  end
end

# Calling Fortran routine:
"""
subroutine update_position(i,n,x)
  implicit none
  integer :: j, n
  integer*8 i
  double precision :: x(3,n)
  do j = 1, 3
    x(j,i) = 2*x(j,i)
  end do
end subroutine update_position

subroutine f1(n,x,m,list)
  implicit none
  integer :: i, j, n, m
  integer*8 :: list(m)
  double precision :: x(3,n)
  do i = 1, m
    call update_position(list(i),n,x)
  end do
end subroutine f1   

Compiled with:
gfortran -O3 mutate.f90 -shared -o mutate.so
"""
f2!(x,list) = ccall((:f1_,"mutate.so"),Nothing,
                    (Ref{Int32},         Ref{Float64}, Ref{Int32},          Ref{Int64}),
                     Int32(size(x,2)),   x,            Int32(size(list,1)), list)     


# Running

n = 1_000_000
m = 1_000
list = rand(1:n,m)

# With julia immutable struct
x = [ P(1.,1.,1.) for i in 1:n ]
f1!(x,list)

# Fortran Matrix
y = ones(3,n) 
f2!(y,list)

# Testing if the fortran code returns the same thing
@test begin
  t = true
  for i in 1:n
    if x[i].x != y[1,i] || x[i].y != y[2,i] || x[i].z != y[3,i] 
      t = false
    end
  end
  t
end

setup = ( x = [ P(1.,1.,1.) for i in 1:n ] )
print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1

setup = ( y = ones(3,n) )
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1

                                                 
3 Likes

On computers with AVX, you can improve performance by padding out P to be 256 bits, i.e. make it 4D but just ignore the last.
Additionally, gfortran generally needs the compiler flag -fno-semantic-interposition for performance. Without it, it will not inline functions.

I also added @inbounds to the Julia version to make it equivalent to the Fortran.

Code
module mutate

  use ISO_C_BINDING

  contains
    subroutine update_position(x)
      implicit none
      real(C_double), dimension(:), intent(inout) :: x
      x = 2*x
    end subroutine update_position

    subroutine f1(n,x,m,list) BIND(C, name="f1")
      implicit none
      integer(C_long) :: i, n, m
      integer(C_long) :: list(m)
      real(C_double) :: x(3,n)
      do i = 1, m
         call update_position(x(:,list(i)))
      end do
    end subroutine f1

    subroutine f2(n,x,m,list) BIND(C, name="f2")
      implicit none
      integer(C_long) :: i, n, m
      integer(C_long) :: list(m)
      real(C_double) :: x(4,n)
      do i = 1, m
         call update_position(x(:,list(i)))
      end do
    end subroutine f2

end module mutate

I compiled with
gfortran -Ofast -march=native -Wall -fno-semantic-interposition -shared -fPIC mutate.f90 -o libmutate.so

using Test, BenchmarkTools

struct P{T}
  x :: T
  y :: T
  z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )

struct Ppad{T}
  x :: T
  y :: T
  z :: T
  p :: T
end
function Ppad(_x, _y, _z)
    x, y, z = promote(_x, _y, _z)
    Ppad(x, y, z, zero(x))
end
update_position( x :: Ppad ) = Ppad( 2x.x, 2x.y, 2x.z, 2x.p )

function f1!(x,list)
  @inbounds for i in list
    x[i] = update_position(x[i])
  end
end

const LIBMUTATE = "libmutate.so"
f2_3!(x,list) = ccall((:f1, LIBMUTATE), Nothing,
                    (Ref{Clong},  Ref{Float64}, Ref{Clong}, Ref{Clong}),
                     size(x,2), x, size(list,1), list)     
f2_4!(x,list) = ccall((:f2, LIBMUTATE), Nothing,
                    (Ref{Clong},  Ref{Float64}, Ref{Clong}, Ref{Clong}),
                     size(x,2), x, size(list,1), list)     
f2!(x, list) = size(x,1) == 3 ? f2_3!(x, list) : f2_4!(x, list)

n = 1_000_000
m = 1_000
list = rand(1:n,m);

setup = ( x = [ P(1.,1.,1.) for i in 1:n ] );
print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1
setup = ( x = [ Ppad(1.,1.,1.) for i in 1:n ] );
print(" Julia with Ppad: "); @btime f1!(x,$list) setup = setup evals = 1

setup = ( y = ones(3,n) );
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
setup = ( y = ones(4,n) );
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1

This yields:

julia> setup = ( x = [ P(1.,1.,1.) for i in 1:n ] );

julia> print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1
 Julia with P:   978.000 ns (0 allocations: 0 bytes)

julia> setup = ( x = [ Ppad(1.,1.,1.) for i in 1:n ] );

julia> print(" Julia with Ppad: "); @btime f1!(x,$list) setup = setup evals = 1
 Julia with Ppad:   752.000 ns (0 allocations: 0 bytes)

julia> setup = ( y = ones(3,n) );

julia> print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
 Fortran:   983.000 ns (0 allocations: 0 bytes)

julia> setup = ( y = ones(4,n) );

julia> print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
 Fortran:   852.000 ns (0 allocations: 0 bytes)
2 Likes

Interesting, that padding stuff is advanced stuff for me :slight_smile:

Concerning the compiler flags, I get the best performance (the same as that with your flags) with

-O3 -march=native

Another thing: the -Ofast option, as I checked, included fastmath. Probably one should compare with the loop in Julia using @fastmath, but I didn’t see any important difference here.

By the way, thank you very much for the example on how to call Fortran routines from Julia. I struggled a lot to find good examples of that, the manual is quite short in examples showing both codes.

Does this mean you get better performance without -fno-semantic-interposition?
I guess it depends on compiler options. Godbolt shows the functions inlining regardless of -fno-semantic-interposition, but when I compile locally (checking asm with -S -masm=intel), the functions do not inline unless I add -fno-semantic-interposition.

Fast math makes no difference there; associativity isn’t needed for optimizations here.
I just didn’t bother editing it when jumping up in my history to the last gfortran invocation.

SIMD (Single Instruction Multiple Data) lets the CPU operate on multiple numbers per instruction.
With P,

julia> xp = [ Ppad(1.,1.,1.) for i in 1:n ];

julia> x = [ P(1.,1.,1.) for i in 1:n ];

julia> @code_llvm debuginfo=:none f1!(x, list)

julia> @code_llvm debuginfo=:none f1!(xp, list)

This yields, without padding:

L9:                                               ; preds = %L9, %L9.lr.ph
  %value_phi14 = phi i64 [ 0, %L9.lr.ph ], [ %21, %L9 ]
  %14 = getelementptr inbounds i64, i64* %12, i64 %value_phi14
  %15 = load i64, i64* %14, align 8
  %16 = add i64 %15, -1
  %.elt = getelementptr inbounds [3 x double], [3 x double]* %13, i64 %16, i64 0
  %17 = bitcast double* %.elt to <2 x double>*
  %18 = load <2 x double>, <2 x double>* %17, align 8
  %.elt7 = getelementptr inbounds [3 x double], [3 x double]* %13, i64 %16, i64 2
  %.unpack8 = load double, double* %.elt7, align 8
  %19 = fmul <2 x double> %18, <double 2.000000e+00, double 2.000000e+00>
  %20 = fmul double %.unpack8, 2.000000e+00
  store <2 x double> %19, <2 x double>* %17, align 8
  store double %20, double* %.elt7, align 8
  %21 = add nuw nsw i64 %value_phi14, 1
  %exitcond.not = icmp eq i64 %21, %8
  br i1 %exitcond.not, label %L28, label %L9

It performs 2 loads:

  %18 = load <2 x double>, <2 x double>* %17, align 8
  %.unpack8 = load double, double* %.elt7, align 8

Two multiplies (these later get turned into adds):

  %19 = fmul <2 x double> %18, <double 2.000000e+00, double 2.000000e+00>
  %20 = fmul double %.unpack8, 2.000000e+00

and two stores:

  store <2 x double> %19, <2 x double>* %17, align 8
  store double %20, double* %.elt7, align 8

One of these is of <2 x double>, and the other a single double.

With padding:

L9:                                               ; preds = %L9, %L9.lr.ph
  %value_phi19 = phi i64 [ 0, %L9.lr.ph ], [ %20, %L9 ]
  %14 = getelementptr inbounds i64, i64* %12, i64 %value_phi19
  %15 = load i64, i64* %14, align 8
  %16 = add i64 %15, -1
  %.elt = getelementptr inbounds [4 x double], [4 x double]* %13, i64 %16, i64 0
  %17 = bitcast double* %.elt to <4 x double>*
  %18 = load <4 x double>, <4 x double>* %17, align 8
  %19 = fmul <4 x double> %18, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
  store <4 x double> %19, <4 x double>* %17, align 8
  %20 = add nuw nsw i64 %value_phi19, 1
  %exitcond.not = icmp eq i64 %20, %8
  br i1 %exitcond.not, label %L31, label %L9

We now have just a single load:

  %18 = load <4 x double>, <4 x double>* %17, align 8

A single multiply:

  %19 = fmul <4 x double> %18, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>

and a single store:

  store <4 x double> %19, <4 x double>* %17, align 8

So on CPUs with the AVX instruction set (which provides 256 bit vectors, letting them operate on <4 x double>s at a time), padding to add an extra operation will actually decrease the number of instructions that must be performed: instead of 2 + 1 = 3, we can do 4 at a time.
Of course, there are a lot of other instructions aside from just these, so the actual performance difference is smaller.

6 Likes

No, sorry, I get roughly the same performance. I could not see any difference. I didn’t know about that tool, very interesting.

ps. By chance do you teach all this stuff somewhere? The way you approach these problems is a complete new world to me.

2 Likes

No, but I’ll work on some material eventually. I’ll probably write some for LoopVectorization’s eventual v1.0 release (which is months away).

9 Likes