=
is not an equality test, but an assignment. Confusing, perhaps. But entirely consistent across many languages.
When I learned that the syntax x[i] = v
is converted to setindex!(x, i, v)
in Julia, I didnāt need to think too much about what is mutable or not, it is pretty clear that something is being set to a new value. The syntax x = v
is a change of name as others have already pointed out.
FWIW, Fortran compilers normally donāt have a problem putting all the arrays youāre mutating on the stack. For example, see the gfortran flag -fstack-arrays.
In Julia, StaticArrays.MArrays
will also generally be placed on the stack, so long as they do not escape the function creating them or get passed as an argument to a function that isnāt inlined.
A feature request I have in Julia (maybe Iāll make a PR) is support for alloca
, so itās easier to stack allocate memory thatās freed when the function returns, but that you can pass as an argument to non-inlined functions. llvmcall
doesnāt work for this, because the memory is immediately freed when the llvmcall returns, not when the Julia function calling llvmcall returns.
julia> using StaticArrays, BenchmarkTools
julia> function f()
s = 0.
for i in 1:1000
x = SVector{3,Float64}(i, sqrt(i), i^2)
for j in 1:3
s = s + x[j]
end
end
s
end
f (generic function with 1 method)
julia> function g()
s = 0.
for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
g (generic function with 1 method)
julia> @btime f()
2.605 Ī¼s (0 allocations: 0 bytes)
3.343550974558874e8
julia> @btime g()
2.605 Ī¼s (0 allocations: 0 bytes)
3.343550974558874e8
However, SVector
s are easier for the compiler; using the MVector
can prevent some optimizations; see what happens when we add @fastmath
:
julia> @fastmath function f()
s = 0.
for i in 1:1000
x = SVector{3,Float64}(i, sqrt(i), i^2)
for j in 1:3
s = s + x[j]
end
end
s
end
f (generic function with 1 method)
julia> @fastmath function g()
s = 0.
for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
g (generic function with 1 method)
julia> @btime f()
744.343 ns (0 allocations: 0 bytes)
3.3435509745588744e8
julia> @btime g()
2.604 Ī¼s (0 allocations: 0 bytes)
3.343550974558874e8
Suddenly f
(using SVector
s) is much faster, but the speed of g()
is almost unchanged.
But my point here is that, just because the MVector
is mutable and getindex
and setindex!
are defined using loads and stores to/from their pointers doesnāt mean it has to be heap allocated.
EDIT:
The simplest way to speed up g
is to use @inbounds
:
julia> @fastmath function h()
s = 0.
@inbounds for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
h (generic function with 1 method)
julia> @btime h()
744.317 ns (0 allocations: 0 bytes)
3.3435509745588744e8
A less simple way is to manually unroll the inner loop:
julia> @fastmath function g()
s = 0.
for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
Base.Cartesian.@nexprs 3 j -> begin
s = s + x[j]
end
end
s
end
g (generic function with 1 method)
julia> @btime g()
744.324 ns (0 allocations: 0 bytes)
3.3435509745588744e8
I guess itās close to a limit where the compiler gives up optimizing it.
EDIT:
It just occurred to me that AVX512 is probably needed for the @fastmath
versions of this function to be fast, because AVX512 is needed for SIMD Int64 -> Float64
conversion.
If you donāt have AVX512, Iād recommend
julia> @fastmath function h2()
s = 0.; i = 0.0
@inbounds for _ in 1:1000
x = MVector{3,Float64}(undef);
i += 1
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
h2 (generic function with 1 method)
julia> @btime h2()
711.417 ns (0 allocations: 0 bytes)
3.3435509745588744e8
I will post an example of the use of immutable types to compare with a Fortran routine. I do this to learn things myself, so why not shareā¦ I think this is an example on how the Julia syntax is nice when we get used to it:
Problem: Given the positions of N particles in 3D, update the positions of a subset of size M of these particles defined in a list.
Using Julia, I will first define an immutable struct to contain the coordinates, and a constructor that updated the particles with the transformation desired (in this case, multiply the coordinates by 2):
struct P{T}
x :: T
y :: T
z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )
Let us generate a vector of these particles:
n = 1_000_000
x = [ P(1.,1.,1.) for i in 1:n ]
For the Fortran routine, the natural way will be to create a matrix, so we do:
y = ones(3,n)
We will create a list of indexes of the particles that should have their positions updated, with:
m = 1_000
list = rand(1:n,m)
Then, the Julia function that updates the positions of these m
particles is:
function f1!(x,list)
for i in list
x[i] = update_position(x[i])
end
end
The Fortran routines to do the same job are:
subroutine update_position(i,n,x)
implicit none
integer :: j, n
integer*8 i
double precision :: x(3,n)
do j = 1, 3
x(j,i) = 2*x(j,i)
end do
end subroutine update_position
subroutine f1(n,x,m,list)
implicit none
integer :: i, j, n, m
integer*8 :: list(m)
double precision :: x(3,n)
do i = 1, m
call update_position(list(i),n,x)
end do
end subroutine f1
I think that it is pretty clear that the Julia code is much simpler and clear. Bugs turn out to be easier to find at the end, even with the more permissive syntax of Julia. Of course we can suggest changes in both codes, but in particular if we want to keep the function that updates the position of a single particle separated from the one that updates the lists, the Fortran code code will not be much simpler than that.
Finally, the Julia function turns out to be even slightly faster than the Fortran routine:
Julia with P: 1.869 Ī¼s (0 allocations: 0 bytes)
Fortran: 2.292 Ī¼s (0 allocations: 0 bytes)
(although I know that benchmarks are always complicated and a disclaimer: if we wanted to update the positions of all particles, which are contiguous in memory, than we should use a matrix of particles in row-order, allowing loop vectorizations. The case here is that the positions to be updated are not contiguous in memory).
The complete code is here:
Summary
using Test
using BenchmarkTools
# Using user defined immutable type
struct P{T}
x :: T
y :: T
z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )
function f1!(x,list)
for i in list
x[i] = update_position(x[i])
end
end
# Calling Fortran routine:
"""
subroutine update_position(i,n,x)
implicit none
integer :: j, n
integer*8 i
double precision :: x(3,n)
do j = 1, 3
x(j,i) = 2*x(j,i)
end do
end subroutine update_position
subroutine f1(n,x,m,list)
implicit none
integer :: i, j, n, m
integer*8 :: list(m)
double precision :: x(3,n)
do i = 1, m
call update_position(list(i),n,x)
end do
end subroutine f1
Compiled with:
gfortran -O3 mutate.f90 -shared -o mutate.so
"""
f2!(x,list) = ccall((:f1_,"mutate.so"),Nothing,
(Ref{Int32}, Ref{Float64}, Ref{Int32}, Ref{Int64}),
Int32(size(x,2)), x, Int32(size(list,1)), list)
# Running
n = 1_000_000
m = 1_000
list = rand(1:n,m)
# With julia immutable struct
x = [ P(1.,1.,1.) for i in 1:n ]
f1!(x,list)
# Fortran Matrix
y = ones(3,n)
f2!(y,list)
# Testing if the fortran code returns the same thing
@test begin
t = true
for i in 1:n
if x[i].x != y[1,i] || x[i].y != y[2,i] || x[i].z != y[3,i]
t = false
end
end
t
end
setup = ( x = [ P(1.,1.,1.) for i in 1:n ] )
print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1
setup = ( y = ones(3,n) )
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
On computers with AVX, you can improve performance by padding out P
to be 256 bits, i.e. make it 4D but just ignore the last.
Additionally, gfortran generally needs the compiler flag -fno-semantic-interposition
for performance. Without it, it will not inline functions.
I also added @inbounds
to the Julia version to make it equivalent to the Fortran.
Code
module mutate
use ISO_C_BINDING
contains
subroutine update_position(x)
implicit none
real(C_double), dimension(:), intent(inout) :: x
x = 2*x
end subroutine update_position
subroutine f1(n,x,m,list) BIND(C, name="f1")
implicit none
integer(C_long) :: i, n, m
integer(C_long) :: list(m)
real(C_double) :: x(3,n)
do i = 1, m
call update_position(x(:,list(i)))
end do
end subroutine f1
subroutine f2(n,x,m,list) BIND(C, name="f2")
implicit none
integer(C_long) :: i, n, m
integer(C_long) :: list(m)
real(C_double) :: x(4,n)
do i = 1, m
call update_position(x(:,list(i)))
end do
end subroutine f2
end module mutate
I compiled with
gfortran -Ofast -march=native -Wall -fno-semantic-interposition -shared -fPIC mutate.f90 -o libmutate.so
using Test, BenchmarkTools
struct P{T}
x :: T
y :: T
z :: T
end
update_position( x :: P ) = P( 2*x.x, 2*x.y, 2*x.z )
struct Ppad{T}
x :: T
y :: T
z :: T
p :: T
end
function Ppad(_x, _y, _z)
x, y, z = promote(_x, _y, _z)
Ppad(x, y, z, zero(x))
end
update_position( x :: Ppad ) = Ppad( 2x.x, 2x.y, 2x.z, 2x.p )
function f1!(x,list)
@inbounds for i in list
x[i] = update_position(x[i])
end
end
const LIBMUTATE = "libmutate.so"
f2_3!(x,list) = ccall((:f1, LIBMUTATE), Nothing,
(Ref{Clong}, Ref{Float64}, Ref{Clong}, Ref{Clong}),
size(x,2), x, size(list,1), list)
f2_4!(x,list) = ccall((:f2, LIBMUTATE), Nothing,
(Ref{Clong}, Ref{Float64}, Ref{Clong}, Ref{Clong}),
size(x,2), x, size(list,1), list)
f2!(x, list) = size(x,1) == 3 ? f2_3!(x, list) : f2_4!(x, list)
n = 1_000_000
m = 1_000
list = rand(1:n,m);
setup = ( x = [ P(1.,1.,1.) for i in 1:n ] );
print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1
setup = ( x = [ Ppad(1.,1.,1.) for i in 1:n ] );
print(" Julia with Ppad: "); @btime f1!(x,$list) setup = setup evals = 1
setup = ( y = ones(3,n) );
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
setup = ( y = ones(4,n) );
print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
This yields:
julia> setup = ( x = [ P(1.,1.,1.) for i in 1:n ] );
julia> print(" Julia with P: "); @btime f1!(x,$list) setup = setup evals = 1
Julia with P: 978.000 ns (0 allocations: 0 bytes)
julia> setup = ( x = [ Ppad(1.,1.,1.) for i in 1:n ] );
julia> print(" Julia with Ppad: "); @btime f1!(x,$list) setup = setup evals = 1
Julia with Ppad: 752.000 ns (0 allocations: 0 bytes)
julia> setup = ( y = ones(3,n) );
julia> print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
Fortran: 983.000 ns (0 allocations: 0 bytes)
julia> setup = ( y = ones(4,n) );
julia> print(" Fortran: "); @btime f2!(y,$list) setup=setup evals=1
Fortran: 852.000 ns (0 allocations: 0 bytes)
Interesting, that padding stuff is advanced stuff for me
Concerning the compiler flags, I get the best performance (the same as that with your flags) with
-O3 -march=native
Another thing: the -Ofast
option, as I checked, included fastmath
. Probably one should compare with the loop in Julia using @fastmath
, but I didnāt see any important difference here.
By the way, thank you very much for the example on how to call Fortran routines from Julia. I struggled a lot to find good examples of that, the manual is quite short in examples showing both codes.
Does this mean you get better performance without -fno-semantic-interposition
?
I guess it depends on compiler options. Godbolt shows the functions inlining regardless of -fno-semantic-interposition
, but when I compile locally (checking asm with -S -masm=intel
), the functions do not inline unless I add -fno-semantic-interposition
.
Fast math makes no difference there; associativity isnāt needed for optimizations here.
I just didnāt bother editing it when jumping up in my history to the last gfortran
invocation.
SIMD (Single Instruction Multiple Data) lets the CPU operate on multiple numbers per instruction.
With P
,
julia> xp = [ Ppad(1.,1.,1.) for i in 1:n ];
julia> x = [ P(1.,1.,1.) for i in 1:n ];
julia> @code_llvm debuginfo=:none f1!(x, list)
julia> @code_llvm debuginfo=:none f1!(xp, list)
This yields, without padding:
L9: ; preds = %L9, %L9.lr.ph
%value_phi14 = phi i64 [ 0, %L9.lr.ph ], [ %21, %L9 ]
%14 = getelementptr inbounds i64, i64* %12, i64 %value_phi14
%15 = load i64, i64* %14, align 8
%16 = add i64 %15, -1
%.elt = getelementptr inbounds [3 x double], [3 x double]* %13, i64 %16, i64 0
%17 = bitcast double* %.elt to <2 x double>*
%18 = load <2 x double>, <2 x double>* %17, align 8
%.elt7 = getelementptr inbounds [3 x double], [3 x double]* %13, i64 %16, i64 2
%.unpack8 = load double, double* %.elt7, align 8
%19 = fmul <2 x double> %18, <double 2.000000e+00, double 2.000000e+00>
%20 = fmul double %.unpack8, 2.000000e+00
store <2 x double> %19, <2 x double>* %17, align 8
store double %20, double* %.elt7, align 8
%21 = add nuw nsw i64 %value_phi14, 1
%exitcond.not = icmp eq i64 %21, %8
br i1 %exitcond.not, label %L28, label %L9
It performs 2 loads:
%18 = load <2 x double>, <2 x double>* %17, align 8
%.unpack8 = load double, double* %.elt7, align 8
Two multiplies (these later get turned into adds):
%19 = fmul <2 x double> %18, <double 2.000000e+00, double 2.000000e+00>
%20 = fmul double %.unpack8, 2.000000e+00
and two stores:
store <2 x double> %19, <2 x double>* %17, align 8
store double %20, double* %.elt7, align 8
One of these is of <2 x double>
, and the other a single double.
With padding:
L9: ; preds = %L9, %L9.lr.ph
%value_phi19 = phi i64 [ 0, %L9.lr.ph ], [ %20, %L9 ]
%14 = getelementptr inbounds i64, i64* %12, i64 %value_phi19
%15 = load i64, i64* %14, align 8
%16 = add i64 %15, -1
%.elt = getelementptr inbounds [4 x double], [4 x double]* %13, i64 %16, i64 0
%17 = bitcast double* %.elt to <4 x double>*
%18 = load <4 x double>, <4 x double>* %17, align 8
%19 = fmul <4 x double> %18, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
store <4 x double> %19, <4 x double>* %17, align 8
%20 = add nuw nsw i64 %value_phi19, 1
%exitcond.not = icmp eq i64 %20, %8
br i1 %exitcond.not, label %L31, label %L9
We now have just a single load:
%18 = load <4 x double>, <4 x double>* %17, align 8
A single multiply:
%19 = fmul <4 x double> %18, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
and a single store:
store <4 x double> %19, <4 x double>* %17, align 8
So on CPUs with the AVX instruction set (which provides 256 bit vectors, letting them operate on <4 x double>
s at a time), padding to add an extra operation will actually decrease the number of instructions that must be performed: instead of 2 + 1 = 3
, we can do 4
at a time.
Of course, there are a lot of other instructions aside from just these, so the actual performance difference is smaller.
No, sorry, I get roughly the same performance. I could not see any difference. I didnāt know about that tool, very interesting.
ps. By chance do you teach all this stuff somewhere? The way you approach these problems is a complete new world to me.
No, but Iāll work on some material eventually. Iāll probably write some for LoopVectorizationās eventual v1.0 release (which is months away).