Julia's assignment behavior differs from Fortran?

= is not an equality test, but an assignment. Confusing, perhaps. But entirely consistent across many languages.

1 Like

When I learned that the syntax x[i] = v is converted to setindex!(x, i, v) in Julia, I didn’t need to think too much about what is mutable or not, it is pretty clear that something is being set to a new value. The syntax x = v is a change of name as others have already pointed out.

2 Likes

FWIW, Fortran compilers normally don’t have a problem putting all the arrays you’re mutating on the stack. For example, see the gfortran flag -fstack-arrays.
In Julia, StaticArrays.MArrays will also generally be placed on the stack, so long as they do not escape the function creating them or get passed as an argument to a function that isn’t inlined.

A feature request I have in Julia (maybe I’ll make a PR) is support for alloca, so it’s easier to stack allocate memory that’s freed when the function returns, but that you can pass as an argument to non-inlined functions. llvmcall doesn’t work for this, because the memory is immediately freed when the llvmcall returns, not when the Julia function calling llvmcall returns.

julia> using StaticArrays, BenchmarkTools

julia> function f()
         s = 0.
         for i in 1:1000
           x = SVector{3,Float64}(i, sqrt(i), i^2)
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
f (generic function with 1 method)

julia> function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime f()
  2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8

julia> @btime g()
  2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8

However, SVectors are easier for the compiler; using the MVector can prevent some optimizations; see what happens when we add @fastmath:

julia> @fastmath function f()
         s = 0.
         for i in 1:1000
           x = SVector{3,Float64}(i, sqrt(i), i^2)
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
f (generic function with 1 method)

julia> @fastmath function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime f()
  744.343 ns (0 allocations: 0 bytes)
3.3435509745588744e8

julia> @btime g()
  2.604 μs (0 allocations: 0 bytes)
3.343550974558874e8

Suddenly f (using SVectors) is much faster, but the speed of g() is almost unchanged.

But my point here is that, just because the MVector is mutable and getindex and setindex! are defined using loads and stores to/from their pointers doesn’t mean it has to be heap allocated.

EDIT:
The simplest way to speed up g is to use @inbounds:

julia> @fastmath function h()
         s = 0.
         @inbounds for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
h (generic function with 1 method)

julia> @btime h()
  744.317 ns (0 allocations: 0 bytes)
3.3435509745588744e8

A less simple way is to manually unroll the inner loop:

julia> @fastmath function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           Base.Cartesian.@nexprs 3 j -> begin
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime g()
  744.324 ns (0 allocations: 0 bytes)
3.3435509745588744e8

I guess it’s close to a limit where the compiler gives up optimizing it.

EDIT:
It just occurred to me that AVX512 is probably needed for the @fastmath versions of this function to be fast, because AVX512 is needed for SIMD Int64 -> Float64 conversion.
If you don’t have AVX512, I’d recommend

julia> @fastmath function h2()
         s = 0.; i = 0.0
         @inbounds for _ in 1:1000
           x = MVector{3,Float64}(undef);
           i += 1
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
h2 (generic function with 1 method)

julia> @btime h2()
  711.417 ns (0 allocations: 0 bytes)
3.3435509745588744e8
4 Likes

I will post an example of the use of immutable types to compare with a Fortran routine. I do this to learn things myself, so why not share… I think this is an example on how the Julia syntax is nice when we get used to it:

Problem: Given the positions of N particles in 3D, update the positions of a subset of size M of these particles defined in a list.

Using Julia, I will first define an immutable struct to contain the coordinates, and a constructor that updated the particles with the transformation desired (in this case, multiply the coordinates by 2):

struct P{T}
  x :: T
  y :: T
  z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )  

Let us generate a vector of these particles:

n = 1_000_000
x = [ P(1.,1.,1.) for i in 1:n ]

For the Fortran routine, the natural way will be to create a matrix, so we do:

y = ones(3,n)

We will create a list of indexes of the particles that should have their positions updated, with:

m = 1_000
list = rand(1:n,m)

Then, the Julia function that updates the positions of these m particles is:

function f1!(x,list)
  for i in list
    x[i] = update_position(x[i])
  end
end    

The Fortran routines to do the same job are:

subroutine update_position(i,n,x)
  implicit none
  integer :: j, n
  integer*8 i
  double precision :: x(3,n)
  do j = 1, 3
    x(j,i) = 2*x(j,i)
  end do
end subroutine update_position

subroutine f1(n,x,m,list)
  implicit none
  integer :: i, j, n, m
  integer*8 :: list(m)
  double precision :: x(3,n)
  do i = 1, m
    call update_position(list(i),n,x)
  end do
end subroutine f1  

I think that it is pretty clear that the Julia code is much simpler and clear. Bugs turn out to be easier to find at the end, even with the more permissive syntax of Julia. Of course we can suggest changes in both codes, but in particular if we want to keep the function that updates the position of a single particle separated from the one that updates the lists, the Fortran code code will not be much simpler than that.

Finally, the Julia function turns out to be even slightly faster than the Fortran routine:

 Julia with P:   1.869 μs (0 allocations: 0 bytes)
 Fortran:   2.292 μs (0 allocations: 0 bytes)

(although I know that benchmarks are always complicated and a disclaimer: if we wanted to update the positions of all particles, which are contiguous in memory, than we should use a matrix of particles in row-order, allowing loop vectorizations. The case here is that the positions to be updated are not contiguous in memory).

The complete code is here:

Summary
using Test
using BenchmarkTools

# Using user defined immutable type 
struct P{T}
  x :: T
  y :: T
  z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )

function f1!(x,list)
  for i in list
    x[i] = update_position(x[i])
  end
end

# Calling Fortran routine:
"""
subroutine update_position(i,n,x)
  implicit none
  integer :: j, n
  integer*8 i
  double precision :: x(3,n)
  do j = 1, 3
    x(j,i) = 2*x(j,i)
  end do
end subroutine update_position

subroutine f1(n,x,m,list)
  implicit none
  integer :: i, j, n, m
  integer*8 :: list(m)
  double precision :: x(3,n)
  do i = 1, m
    call update_position(list(i),n,x)
  end do
end subroutine f1   

Compiled with:
gfortran -O3 mutate.f90 -shared -o mutate.so
"""
f2!(x,list) = ccall((:f1_,"mutate.so"),Nothing,
                    (Ref{Int32},         Ref{Float64}, Ref{Int32},          Ref{Int64}),
                     Int32(size(x,2)),   x,            Int32(size(list,1)), list)     


# Running

n = 1_000_000
m = 1_000
list = rand(1:n,m)

# With julia immutable struct
x = [ P(1.,1.,1.) for i in 1:n ]
f1!(x,list)

# Fortran Matrix
y = ones(3,n) 
f2!(y,list)

# Testing if the fortran code returns the same thing
@test begin
  t = true
  for i in 1:n
    if x[i].x != y[1,i] || x[i].y != y[2,i] || x[i].z != y[3,i] 
      t = false
    end
  end
  t
end

setup = ( x = [ P(1.,1.,1.) for i in 1:n ] )
print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1

setup = ( y = ones(3,n) )
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1

                                                 
3 Likes

On computers with AVX, you can improve performance by padding out P to be 256 bits, i.e. make it 4D but just ignore the last.
Additionally, gfortran generally needs the compiler flag -fno-semantic-interposition for performance. Without it, it will not inline functions.

I also added @inbounds to the Julia version to make it equivalent to the Fortran.

Code
module mutate

  use ISO_C_BINDING

  contains
    subroutine update_position(x)
      implicit none
      real(C_double), dimension(:), intent(inout) :: x
      x = 2*x
    end subroutine update_position

    subroutine f1(n,x,m,list) BIND(C, name="f1")
      implicit none
      integer(C_long) :: i, n, m
      integer(C_long) :: list(m)
      real(C_double) :: x(3,n)
      do i = 1, m
         call update_position(x(:,list(i)))
      end do
    end subroutine f1

    subroutine f2(n,x,m,list) BIND(C, name="f2")
      implicit none
      integer(C_long) :: i, n, m
      integer(C_long) :: list(m)
      real(C_double) :: x(4,n)
      do i = 1, m
         call update_position(x(:,list(i)))
      end do
    end subroutine f2

end module mutate

I compiled with
gfortran -Ofast -march=native -Wall -fno-semantic-interposition -shared -fPIC mutate.f90 -o libmutate.so

using Test, BenchmarkTools

struct P{T}
  x :: T
  y :: T
  z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )

struct Ppad{T}
  x :: T
  y :: T
  z :: T
  p :: T
end
function Ppad(_x, _y, _z)
    x, y, z = promote(_x, _y, _z)
    Ppad(x, y, z, zero(x))
end
update_position( x :: Ppad ) = Ppad( 2x.x, 2x.y, 2x.z, 2x.p )

function f1!(x,list)
  @inbounds for i in list
    x[i] = update_position(x[i])
  end
end

const LIBMUTATE = "libmutate.so"
f2_3!(x,list) = ccall((:f1, LIBMUTATE), Nothing,
                    (Ref{Clong},  Ref{Float64}, Ref{Clong}, Ref{Clong}),
                     size(x,2), x, size(list,1), list)     
f2_4!(x,list) = ccall((:f2, LIBMUTATE), Nothing,
                    (Ref{Clong},  Ref{Float64}, Ref{Clong}, Ref{Clong}),
                     size(x,2), x, size(list,1), list)     
f2!(x, list) = size(x,1) == 3 ? f2_3!(x, list) : f2_4!(x, list)

n = 1_000_000
m = 1_000
list = rand(1:n,m);

setup = ( x = [ P(1.,1.,1.) for i in 1:n ] );
print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1
setup = ( x = [ Ppad(1.,1.,1.) for i in 1:n ] );
print(" Julia with Ppad: "); @btime f1!(x,$list) setup = setup evals = 1

setup = ( y = ones(3,n) );
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
setup = ( y = ones(4,n) );
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1

This yields:

julia> setup = ( x = [ P(1.,1.,1.) for i in 1:n ] );

julia> print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1
 Julia with P:   978.000 ns (0 allocations: 0 bytes)

julia> setup = ( x = [ Ppad(1.,1.,1.) for i in 1:n ] );

julia> print(" Julia with Ppad: "); @btime f1!(x,$list) setup = setup evals = 1
 Julia with Ppad:   752.000 ns (0 allocations: 0 bytes)

julia> setup = ( y = ones(3,n) );

julia> print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
 Fortran:   983.000 ns (0 allocations: 0 bytes)

julia> setup = ( y = ones(4,n) );

julia> print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
 Fortran:   852.000 ns (0 allocations: 0 bytes)
2 Likes

Interesting, that padding stuff is advanced stuff for me :slight_smile:

Concerning the compiler flags, I get the best performance (the same as that with your flags) with

-O3 -march=native

Another thing: the -Ofast option, as I checked, included fastmath. Probably one should compare with the loop in Julia using @fastmath, but I didn’t see any important difference here.

By the way, thank you very much for the example on how to call Fortran routines from Julia. I struggled a lot to find good examples of that, the manual is quite short in examples showing both codes.

Does this mean you get better performance without -fno-semantic-interposition?
I guess it depends on compiler options. Godbolt shows the functions inlining regardless of -fno-semantic-interposition, but when I compile locally (checking asm with -S -masm=intel), the functions do not inline unless I add -fno-semantic-interposition.

Fast math makes no difference there; associativity isn’t needed for optimizations here.
I just didn’t bother editing it when jumping up in my history to the last gfortran invocation.

SIMD (Single Instruction Multiple Data) lets the CPU operate on multiple numbers per instruction.
With P,

julia> xp = [ Ppad(1.,1.,1.) for i in 1:n ];

julia> x = [ P(1.,1.,1.) for i in 1:n ];

julia> @code_llvm debuginfo=:none f1!(x, list)

julia> @code_llvm debuginfo=:none f1!(xp, list)

This yields, without padding:

L9:                                               ; preds = %L9, %L9.lr.ph
  %value_phi14 = phi i64 [ 0, %L9.lr.ph ], [ %21, %L9 ]
  %14 = getelementptr inbounds i64, i64* %12, i64 %value_phi14
  %15 = load i64, i64* %14, align 8
  %16 = add i64 %15, -1
  %.elt = getelementptr inbounds [3 x double], [3 x double]* %13, i64 %16, i64 0
  %17 = bitcast double* %.elt to <2 x double>*
  %18 = load <2 x double>, <2 x double>* %17, align 8
  %.elt7 = getelementptr inbounds [3 x double], [3 x double]* %13, i64 %16, i64 2
  %.unpack8 = load double, double* %.elt7, align 8
  %19 = fmul <2 x double> %18, <double 2.000000e+00, double 2.000000e+00>
  %20 = fmul double %.unpack8, 2.000000e+00
  store <2 x double> %19, <2 x double>* %17, align 8
  store double %20, double* %.elt7, align 8
  %21 = add nuw nsw i64 %value_phi14, 1
  %exitcond.not = icmp eq i64 %21, %8
  br i1 %exitcond.not, label %L28, label %L9

It performs 2 loads:

  %18 = load <2 x double>, <2 x double>* %17, align 8
  %.unpack8 = load double, double* %.elt7, align 8

Two multiplies (these later get turned into adds):

  %19 = fmul <2 x double> %18, <double 2.000000e+00, double 2.000000e+00>
  %20 = fmul double %.unpack8, 2.000000e+00

and two stores:

  store <2 x double> %19, <2 x double>* %17, align 8
  store double %20, double* %.elt7, align 8

One of these is of <2 x double>, and the other a single double.

With padding:

L9:                                               ; preds = %L9, %L9.lr.ph
  %value_phi19 = phi i64 [ 0, %L9.lr.ph ], [ %20, %L9 ]
  %14 = getelementptr inbounds i64, i64* %12, i64 %value_phi19
  %15 = load i64, i64* %14, align 8
  %16 = add i64 %15, -1
  %.elt = getelementptr inbounds [4 x double], [4 x double]* %13, i64 %16, i64 0
  %17 = bitcast double* %.elt to <4 x double>*
  %18 = load <4 x double>, <4 x double>* %17, align 8
  %19 = fmul <4 x double> %18, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
  store <4 x double> %19, <4 x double>* %17, align 8
  %20 = add nuw nsw i64 %value_phi19, 1
  %exitcond.not = icmp eq i64 %20, %8
  br i1 %exitcond.not, label %L31, label %L9

We now have just a single load:

  %18 = load <4 x double>, <4 x double>* %17, align 8

A single multiply:

  %19 = fmul <4 x double> %18, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>

and a single store:

  store <4 x double> %19, <4 x double>* %17, align 8

So on CPUs with the AVX instruction set (which provides 256 bit vectors, letting them operate on <4 x double>s at a time), padding to add an extra operation will actually decrease the number of instructions that must be performed: instead of 2 + 1 = 3, we can do 4 at a time.
Of course, there are a lot of other instructions aside from just these, so the actual performance difference is smaller.

6 Likes

No, sorry, I get roughly the same performance. I could not see any difference. I didn’t know about that tool, very interesting.

ps. By chance do you teach all this stuff somewhere? The way you approach these problems is a complete new world to me.

2 Likes

No, but I’ll work on some material eventually. I’ll probably write some for LoopVectorization’s eventual v1.0 release (which is months away).

9 Likes