The cost of size() in for-loops

Hello everybody,

I have a performance question which I hope has not been asked to death before. I will provide a minimal working example below.

First, let me give some context. The type of code shown below gets called several tens of million times in the actual computation. The array work[] is always fixed size nxm=3x11, at least for the current application. Size is never changed. In the actual computation the difference in total runtime between the variants discussed below is 20 s (@time on the calling function) for the slowest example and 15 s for the fastest one.

I know about StaticArrays, but would like to stick to the stdlib at the moment and the actual behaviour which surprised me is with the standard dynamic arrays (did not test StaticArrays). I am using julia 0.7.0 beta2.

So, now the example:

using BenchmarkTools

  @inline function setup_k_no_if(k::Int,xalt,work,coef,factor=1.0)

    work[:,k].=0.0
    
    for jj=1:size(coef,1)
      for ii=1:size(work,1)
        @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
      end
    end

    for ii=1:size(work,1)
      @inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
    end

    nothing

  end

  @inline function setup_k_with_if(k::Int,xalt,work,coef,factor=1.0)

    work[:,k].=0.0
    
    if (size(work,1)==3)
      for jj=1:size(coef,1)
        for ii=1:3
          @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
        end
      end
    else
      for jj=1:size(coef,1)
        for ii=1:size(work,1)
          @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
        end
      end
    end

    nothing

  end

  @inline function setup_k_fixed_size(k::Int,xalt,work,coef,factor=1.0)

    work[:,k].=0.0
    
    for jj=1:size(coef,1)
      for ii=1:3
        @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
      end
    end

    for ii=1:3
      @inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
    end

    nothing

  end

  work=zeros(3,11)
  work[:,1]=[2.0,1.0,4.0]
  xalt=[1.0,2.0,3.0]
  s21=sqrt(21.0)
  coef11=(0.0,0.0,0.0,0.0,(-42.0+7.0*s21)*20.0,(-18.0+28.0*s21)*8.0,
      (-273.0-53.0*s21)*5.0,(301.0+53.0*s21)*5.0,(28.0-28.0*s21)*8.0,
      (49.0-7.0*s21)*20.0)
  @btime setup_k_no_if(11,xalt,work,coef11,17640.0)
  @btime setup_k_with_if(11,xalt,work,coef11,17640.0)
  @btime setup_k_fixed_size(11,xalt,work,coef11,17640.0)

Calling this gives as a result:

setup_k_no_if:          50.825 ns (0 allocations: 0 bytes)
setup_k_with_if:       35.021 ns (0 allocations: 0 bytes)
setup_k_fixed_size: 38.538 ns (0 allocations: 0 bytes)

In the real world code the setup_k_no_if() variant is 20 s, the setup_k_with_if() is 15 s and the setup_k_fixed_size() is perhaps 14 s.

Now I am wondering:

  • The compiler apparently is not able to infer the size(work,1) at compile time and specialize the code so that the size(work,1) calls are no longer needed. That looks like a missed optimization opportunity, but I believe getting this case right is very hard. Is there a way to help the compiler ?
  • What happens in the setup_k_with_if() version ? Apparently the code is specialized to the case size(work,1)=3 somehow ? EDIT: I should better ask: the if has negligible performance impact somehow, which implies that size(work,1) is not executed every time the function is called ? My previous experience with fortran would suggest to never have an if-else in a loop like this because it would be performance killer. So this was a big surprise. Can someone explain ?
  • IMHO the clean way would of course to use multiple dispatch with a version specialized to size(work,1)=3 and one for the general case. But I believe it is not possible to dispatch on actual size of array dimensions ?

Best Regards

Christof

The compiler apparently is not able to infer the size(work,1) at compile time and specialize the code so that the size(work,1) calls are no longer needed. That looks like a missed optimization opportunity, but I believe getting this case right is very hard. Is there a way to help the compiler ?

How would the compiler know its size?
It is possible if you have a function that creates work as a 3x11 (3 and 11 being constants within the function), so that constant propagation can propagate 3 and 11. I think these optimizations are all implemented? Others know better than me.

IMHO the clean way would of course to use multiple dispatch with a version specialized to size(work,1)=3 and one for the general case. But I believe it is not possible to dispatch on actual size of array dimensions ?

That is exactly what StaticArrays does.

My previous experience with fortran would suggest to never have an if-else in a loop like this because it would be performance killer. So this was a big surprise. Can someone explain ?

Unless I’m missing something, the “if” function is fastest because the other two are doing an extra

for ii=1:3
    @inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
end

on top of what the “if” version is doing.
EDIT:
That only made up a small part of the difference. I think the compiler may be skipping some checks if it can confirm the size is 3?
Anyway, if you need convincing to actually give static arrays a try:

julia> using StaticArrays

julia> mwork = MMatrix{3,11}(work);

julia> @btime setup_k_fixed_size(11,$xalt,$mwork,$coef11,17640.0)
  4.718 ns (0 allocations: 0 bytes)

If you really can’t afford the dependency, and all you want is the statically sized work buffer (ie, doesn’t need matrix or linear algebra operations), it’s only a few lines of code to implement that.

Also, you should interpolate non-constant arguments while benchmarking:

julia> @btime setup_k_no_if(11,$xalt,$work,$coef11,17640.0)
  30.409 ns (0 allocations: 0 bytes)

julia> @btime setup_k_with_if(11,$xalt,$work,$coef11,17640.0)
  13.230 ns (0 allocations: 0 bytes)

julia> @btime setup_k_fixed_size(11,$xalt,$work,$coef11,17640.0)
  20.218 ns (0 allocations: 0 bytes)

If you don’t interpolate, you are also measuring the cost of dynamic dispatch:

julia> @btime setup_k_no_if(11,xalt,work,coef11,17640.0)
  46.640 ns (0 allocations: 0 bytes)

julia> @btime setup_k_with_if(11,xalt,work,coef11,17640.0)
  43.131 ns (0 allocations: 0 bytes)

julia> @btime setup_k_fixed_size(11,xalt,work,coef11,17640.0)
  43.263 ns (0 allocations: 0 bytes)

Assuming your function that takes 15-20 seconds is type stable, it wont be doing dynamic dispatches (specifically, just this one dispatch has to be stable), so the first example (with interpolation) is more representative of what is going on.

1 Like

Hello,

Yes, when preparing the example I deleted those by accident. Corrected version which still shows the speed difference:

using BenchmarkTools

  @inline function setup_k_no_if(k::Int,xalt,work,coef,factor=1.0)

    work[:,k].=0.0
    
    for jj=1:size(coef,1)
      for ii=1:size(work,1)
        @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
      end
    end

    for ii=1:size(work,1)
      @inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
    end

    nothing

  end

  @inline function setup_k_with_if(k::Int,xalt,work,coef,factor=1.0)

    work[:,k].=0.0
    
    if (size(work,1)==3)
      for jj=1:size(coef,1)
        for ii=1:3
          @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
        end
      end

      for ii=1:3
        @inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
      end

    else
      for jj=1:size(coef,1)
        for ii=1:size(work,1)
          @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
        end
      end

      for ii=1:size(work,1)
        @inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
      end
    end

    nothing

  end

  @inline function setup_k_fixed_size(k::Int,xalt,work,coef,factor=1.0)

    work[:,k].=0.0
    
    for jj=1:size(coef,1)
      for ii=1:3
        @inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
      end
    end
    for ii=1:3
      @inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
    end

    nothing

  end

  work=zeros(3,11)
  work[:,1]=[2.0,1.0,4.0]
  xalt=[1.0,2.0,3.0]
  s21=sqrt(21.0)
  coef11=(0.0,0.0,0.0,0.0,(-42.0+7.0*s21)*20.0,(-18.0+28.0*s21)*8.0,
      (-273.0-53.0*s21)*5.0,(301.0+53.0*s21)*5.0,(28.0-28.0*s21)*8.0,
      (49.0-7.0*s21)*20.0)
  @btime setup_k_no_if(11,$xalt,$work,$coef11,17640.0)
  @btime setup_k_with_if(11,$xalt,$work,$coef11,17640.0)
  @btime setup_k_fixed_size(11,$xalt,$work,$coef11,17640.0)

This does not change the qualitative result

  35.353 ns (0 allocations: 0 bytes)
  16.283 ns (0 allocations: 0 bytes)
  21.681 ns (0 allocations: 0 bytes)

and the actual code (20 s to 15 s runtime) does not contain the bug.

In the calling function (actually two levels up) it is created as work=zeros(3,11). So it in principle could know the size if it can decide whether work[] ever could change size or not. And it never changes size, it is always accessed in for-loops using work[ii,jj] syntax. That is why I was really suprised that it suddenly could do something when using the if in the function (which gets called in a for-loop).

It is a 8th order Runge-Kutta solver, only Float64 goes in and only Float64 comes out as best as I can tell, @code_warntype on the functions it can handle is happy. At the moment the right hand side of the ODE has dimension 3, and that is the case I am probably ever going to use it for. But of course it would be more satisfying to have a nice performant general case and not specialize everything to 3, because, you know, one never knows :slight_smile:

I am aware of the DifferentialEquations package, however it dos not work on 0.7.0 yet and the application is pedagogical (but not in computer science).

I do not know what interpolate means in this context. Where could I look this up ? Edit: Of course I tried to make the corrected example above right and used the $ form.

Best Regards

Christof

The compiler does many optimizations and more are being worked on. Apparently this is not one of them. The only guarantee Julia gives is that functions will be specialized on the type, and as Array does not carry its size in the type, no specialized method is compiled for different sizes. (Thus StaticArrays)

Edit: You did the right thing: profile/benchmark the code. Now you know and can work around compiler limitations.

3 Likes

Hello,

yes, I understand this. Now I know apparently how to give it a hint. As I wrote, it was primarily surprise on my side that the cost of the if (size.... was so low.

Yes, this is of course sensible. It is just me trying to avoid a dependency for not very good reasons.

Best Regards

Christof

1 Like

Again, I’d really recommend just using StaticArrays; your updated code:

julia> @btime setup_k_no_if(11,$xalt,$work,$coef11,17640.0)
  30.732 ns (0 allocations: 0 bytes)

julia> @btime setup_k_with_if(11,$xalt,$work,$coef11,17640.0)
  14.908 ns (0 allocations: 0 bytes)

julia> @btime setup_k_fixed_size(11,$xalt,$work,$coef11,17640.0)
  20.460 ns (0 allocations: 0 bytes)

Versus with StaticArrays:

julia> using StaticArrays

julia> mwork = MMatrix{3,11}(work);

julia> @btime setup_k_no_if(11,$xalt,$mwork,$coef11,17640.0)
  15.078 ns (0 allocations: 0 bytes)

julia> @btime setup_k_with_if(11,$xalt,$mwork,$coef11,17640.0)
  6.823 ns (0 allocations: 0 bytes)

julia> @btime setup_k_fixed_size(11,$xalt,$mwork,$coef11,17640.0)
  6.763 ns (0 allocations: 0 bytes)

At least 2x faster.

In the calling function (actually two levels up) it is created as work=zeros(3,11). So it in principle could know the size if it can decide whether work ever could change size or not. And it never changes size, it is always accessed in for-loops using work[ii,jj] syntax. That is why I was really suprised that it suddenly could do something when using the if in the function (which gets called in a for-loop).

Correct me if I’m wrong, but I think Fortran Arrays automatically “switch” (at compile time) between being static / dynamic arrays, passing size information if it’s available following the “assumed shape” convention.
That does sound like a convenient optimization…

I am aware of the DifferentialEquations package, however it dos not work on 0.7.0 yet and the application is pedagogical (but not in computer science).

I really want to study DifferentialEquations one day. The primary author behind it is brilliant, and many consider that ecosystem one of the highlights of Julia.

I do not know what interpolate means in this context. Where could I look this up ? Edit: Of course I tried to make the corrected example above right and used the $ form.

Yes, your updated benchmarks are good.

Interpolating mostly commonly means inserting into a string, an expression, or a command. Might be missing more.
In this case, it means the expression. A macro “sees” the expression that follows, can do arbitrary manipulations, and then spits something out. What it spits out gets compiled and run. It can make a lot of APIs more convenient.

It is just me trying to avoid a dependency for not very good reasons.

Okay…



mutable struct WorkBuffer{M,N,T,L} <: AbstractMatrix{T}
    data::NTuple{L,T}
    WorkBuffer{M,N,T,L}() where {T,M,N,L} = new()
    @generated WorkBuffer{M,N,T}() where {M,N,T} = :(WorkBuffer{$M,$N,$T,$(M*N)}())
    @generated function WorkBuffer{M,N}(data::AbstractMatrix{T}) where {T,M,N}
        quote
            @boundscheck size(data) == ($M,$N) || throw(BoundsError())
            out = WorkBuffer{$M,$N,$T,$(M*N)}()
            @inbounds for i ∈ 1:$(M*N)
                out[i] = data[i]
            end
            out
        end
    end
end

Base.size(::WorkBuffer{M,N}) where {M,N} = (M,N)
Base.length(::WorkBuffer{M,N,T,L}) where {M,N,T,L} = L

@inline function Base.pointer(w::WorkBuffer{M,N,T}) where {M,N,T}
    Base.unsafe_convert(Ptr{T}, pointer_from_objref(w))
end
Base.@propagate_inbounds function Base.getindex(w::WorkBuffer{M,N,T,L}, v, i) where {M,N,T,L}
    @boundscheck i > L && throw(BoundsError())
    unsafe_load(pointer(w), i)
end
Base.@propagate_inbounds function Base.getindex(w::WorkBuffer{M,N}, i, j) where {M,N}
    @boundscheck (i > M || j > N) && throw(BoundsError())
    unsafe_load(pointer(w), i + M*(j-1))
end
Base.@propagate_inbounds function Base.setindex!(w::WorkBuffer{M,N,T,L}, v, i) where {M,N,T,L}
    @boundscheck i > L && throw(BoundsError())
    unsafe_store!(pointer(w), v, i)
    v
end
Base.@propagate_inbounds function Base.setindex!(w::WorkBuffer{M,N}, v, i, j) where {M,N}
    @boundscheck (i > M || j > N) && throw(BoundsError())
    unsafe_store!(pointer(w), v, i + M*(j-1))
    v
end

You’d probably want to add more convenient functions (and make sure they’re type stable).
But, this is enough for:

julia> workb = WorkBuffer{3,11}(work);

julia> @btime setup_k_no_if(11,$xalt,$workb,$coef11,17640.0)
  6.863 ns (0 allocations: 0 bytes)

julia> @btime setup_k_with_if(11,$xalt,$workb,$coef11,17640.0)
  6.833 ns (0 allocations: 0 bytes)

julia> @btime setup_k_fixed_size(11,$xalt,$workb,$coef11,17640.0)
  6.863 ns (0 allocations: 0 bytes)

Hello,

I would not have expected less :slight_smile: Joke aside, I agree I should really do it (as soon as 0.7 is released and everything in the ecosystem stabilizes, anyway).

I only used fortran without understanding much about the compiler internals. I strongly suspect the right compiler can do this in principle (the f77 legacy is still there and without capabilites like that f90 would still be at a severe disadvantage I guess).

The goal we have in homebrewing the RK solvers is to show what is really used behind the scenes once (i.e. demonstrate the application of the numerical approach to a given problem in a simple way, performance is nice). So the code tries to only use general imperative language basics (I am already straying from that). Therefore it is certainly a terrible example of julia.
But of course showing how much more convenient the DifferentialEquations routines are is a must, and will hopefully mitigate our use of the language. Also, it will be interesting to see the performance a general implementation can achieve if written by people with a deep understanding of the language.

Ah, so it enables passing to a macro which then determines and fixes the types somehow to avoid running the full dispatch code I guess.

I see you example with the updated routines. I will try to understand it (and of course modify it so that it fits with the rest of the program). I have seen the building blocks before and roughly know what they do, but for the whole picture I will need to learn more julia. I thought macros like @boundscheck would incur a large performance penalty, but obvioulsy I am mistaken (and DifferentialEquations would not be possible. You benchmark data is proof of that.

Thank you very much for you effort !

Best regards

Christof

I only used fortran without understanding much about the compiler internals. I strongly suspect the right compiler can do this in principle (the f77 legacy is still there and without capabilites like that f90 would still be at a severe disadvantage I guess).

In the case of assumed shape arrays, that was introduced in f90. I don’t know much about f77, nor about compiler internals. But I think Fortran compilers benefit tremendously from sharing a back end with much more popular – and highly optimized – C and C++ compilers (eg, gcc and g++ in the case of gfortran).

The goal we have in homebrewing the RK solvers is to show what is really used behind the scenes once (i.e. demonstrate the application of the numerical approach to a given problem in a simple way, performance is nice). So the code tries to only use general imperative language basics (I am already straying from that). Therefore it is certainly a terrible example of julia.

The “WorkBuffer” code only uses base Julia, to get around that restriction. Defining new types is standard practice, although the code might not be “basic” in the sense of it being something people starting with the language are likely to write (unless they come from C/C++?).

I see you example with the updated routines. I will try to understand it (and of course modify it so that it fits with the rest of the program).

The only update was defining a matrix type parameterized by size, so that you don’t need to add a StaticArrays dependency.
To understand the example, the first thing I’d recommend reading about is interfaces (in particular, for AbstractArrays):
https://docs.julialang.org/en/latest/manual/interfaces/#man-interface-array-1

Some more of the interface could probably be implemented, for example:

IndexStyle(::Type{<:WorkBuffer}) = IndexLinear()

but if you’re only indexing it with work[ii,jj], that’s unnecessary. It would probably increase performance if your code used for i in eachindex(work).

I thought macros like @boundscheck would incur a large performance penalty, but obvioulsy I am mistaken

Macros don’t incur performance penalties. They’re basically equivalent to having written whatever it is they transform your code to, instead.
The @boundscheck macro is used to say that part of your code is performing a bounds check. That way, when you use @inbounds, it “knows” to skip that part of your code.
Looking at the getindex function:

Base.@propagate_inbounds function Base.getindex(w::WorkBuffer{M,N}, i, j) where {M,N}
    @boundscheck (i > M || j > N) && throw(BoundsError())
    unsafe_load(pointer(w), i + M*(j-1))
end

By default, the code would check if i > M, or if j > N, and throw a BoundsError if either of those are true. But, if you have an @inbounds before the getindex, it’ll ignore everything behind the bounds check.
The @propagate_inbounds is unnecessary whenever the function gets inlined, which it practically always should. But it doesn’t hurt. You could replace it with an @inline, or leave off that macro all together.
i + M*(j-1) converts Cartesian coordinates into a linear coordinate, assuming column major order, which is the standard in Julia (and Fortran). It then loads that number directly from the pointer (or stores directly, in the case of setindex!).

Also, all three versions of the function are the same fast, because they are now equivalent. The definition of size:

Base.size(::WorkBuffer{M,N}) where {M,N} = (M,N)

is a constant, based only on the type of the WorkBuffer. Therefore, the “if” statement gets resolved at compile time, and all the for loops are fixed.

The assembly (seen via @code_native setup_k_fixed_size(11,xalt,workb,coef11,17640.0)) was identical for all three cases on my computer, when called on the workbuffer.

1 Like

Modern CPUs execute past branches speculatively and use associative arrays (the hardware version) to remember which way specific branches went in the past. So if you’re benchmarking this with the size always equal to 3 the cost of that branch is effectively nothing.

The Julia compiler could become more aggressive about specializing code on specific Array sizes. In that case, the difference between using a dynamic-size array (Array) and a static-sized array (SArray) would get smaller. Honestly, though, I’m not sure we even want to do that. The way we do things work now is really simple and predictable: if you want to specialize on array dimensions, use SArrays and specialization is guaranteed; if you expect lots of different array sizes and don’t want to specialize on the array size, then use normal Arrays and non-specialization is guaranteed. There are cases where either one is better and the end-user is generally in the best position to know which case they’re in. As long as libraries are appropriately generic and support general AbstractArrays then this is a great way for things to work. And fortunately, Julia has a long-standing tradition of people writing appropriately generic array code in libraries. Moreover, one of the lovely things about this approach is that all code everywhere—no matter how deep the call—that loops over the dimensions of a generic AbstractArray gets specialized when you use an SArray, even if the author never thought about static arrays or specializing on array sizes.

1 Like

Hello everyobody,

I have to defend fortran here. Sorry for being off-topic, but I take the bait.

I wrote my first scientific codes with the (DEC) f77 compiler, so I know f77/f90/f03 from a user perspective quite well. I never noticed that the allocatable/ assumed shape arrays in f90 are slower than static size arrays/ assumed size arrays in f77, otherwise f90 would have never happened IMO. That is what I was referring to. Also, back then number crunching in C was more of an curiosity (well written C was and is competetive) and no one took C++ seriously, it was painfully slow. So highly optimizing C/C++ compilers as a base for performant fortran was not always true (I suspect they even then shared backends).

Today we are switching languages by choosing the frontend, re-using middle and backends, making life easier for the compiler developers. But the pointer aliasing problematic in C/C++ is still a potential issue which fortran avoids. So it is more about code reuse in the compiler, the language performance happens at the higher level in my experience (and is nowadays mostly comparable).

Well, the target audience are third year physics students (introduction to computational physics), not computer science students. So we want/need to stick to basic programming, but want something which can be easily made faster than average interpreted python for some of the problems. Julia already does a great job for this I think. But I am overshooting the target and trying to learn how to get really fast code because I cannot resist.

Exactly, more eventually complex code to execute :slight_smile:

The @generated in the matrix type is what I do not really understand yet, I need to learn this from the documentation. The extensions of the Base getindex/setindex methods is something I mostly understand, but will not fully understand without having understood the type.

OK. So I am not quite up to date it seems. I have not written production numeric code for quite some years now, and avoiding if’s like that was always possible (and I believe it is reasonable style).

Yes, I can understand that view. As remarked earlier, it was my surprise that the if somehow triggered optimization which lead to this thread. I mean, the compiler cannot figure out the fixed size apparently. And then you try (with a f90 code this would be absurd) size(work,1)==3 and suddenly the compiler gets it ? That is strange (to me at least).

Now, I fear of course this is a side effect specific to a given julia release. Eventually it would be useful to be able to give the compiler a hint in a standardized way ? I do not know really.

Edit: I am not thinking about more aggressive automatic optimization here. But if the compiler can and does optimize (see example above) being able to give it a hint somehow that you would prefer the code to be optimized might be useful (or not). I understand the point about reserving optimization on size for static arrays or another specialized type. This makes the rules clear and avoids confusionand compilcation. But currently it does not play by those rules (yet).

Best Regards

Christof

I never noticed that the allocatable/ assumed shape arrays in f90 are slower than static size arrays/ assumed size arrays in f77

Are they? I’m surprised, I’d have thought dimensions would be checked at compile time, when possible?
Maybe this was once true, but not anymore?

That is what I was referring to. Also, back then number crunching in C was more of an curiosity (well written C was and is competetive) and no one took C++ seriously, it was painfully slow. So highly optimizing C/C++ compilers as a base for performant fortran was not always true (I suspect they even then shared backends).

Today we are switching languages by choosing the frontend, re-using middle and backends, making life easier for the compiler developers. But the pointer aliasing problematic in C/C++ is still a potential issue which fortran avoids. So it is more about code reuse in the compiler, the language performance happens at the higher level in my experience (and is nowadays mostly comparable).

C and C++ are normally vectorized fine, despite the lack of aliasing guarantees.
On the other hand, I also noticed an example where the Flang Fortran compiler acted as though there were possible aliasing, and generated bad code. flang vs. GCC (gfortran) 7.3 performance · Issue #504 · flang-compiler/flang · GitHub Maybe that’s immaturity in the front end, though.
F/Clang(++) vs gcc makes a much bigger difference than C vs C++ vs Fortran in my experimenting, when code is written the same way. I would have said LLVM vs gcc, but I’ve had a much easier time getting Julia to do what I want.
There’s bias there, in that I’m most familiar with Julia so my mental models of what works are based mostly on experiences with Julia, but I think Julia’s predictability helps.

In my experimenting with matrix multiplication (where I did get single threaded performance in Julia to roughly tie or beat OpenBLAS @inbounds: is the compiler now so smart that this is no longer necessary? - #32 by Elrod , even though I haven’t gotten around to implementing some basic memory optimizations [namely, recursion]), I wanted to try Fortran to get the idea of behavior of more than one compiler. Through iterations of the compute kennels, it was striking that sometimes Flang failed to optimize correctly, and sometimes gfortran did. But Julia always generated the optimal assembly (under the constraint of how that kernel was written).
My suspicion is that Fortran’s “everything is a reference” risks tripping up the compiler into not eliminating what should be temporary variables, and transferring memory in and out of cpu registers more than necessary.
Julia’s clear distinction between static, immutable, structs and head allocated arrays helps.

Although in some other cases, that Fortran’s arrays are generally stack allocated helps Fortran’s.

Maybe C and C++ would be more like Julia. I chose Fortran as the comparation because I didn’t want to deal with row-major 0-indexed arrays, and it wasn’t hard to write a Julia → Fortran transpiler for simple kernel expressions, while C/C++ syntax is much less compatible.
But I’m curious enough to manually translate a few kernels to see how C and C++ behave.

Anyway, the gcc manual lists the optimizations it can perform:

That is a massive list of optimizations – applied to each of the languages. (-fstack-arrays is Fortran-specific, but statically sized arrays in C/C++ are stored on the stack regardless of flag).

This was my point. A huge number of optimizations, related to vectorization among other things, almost all shared by the languages. Because many are recent, and newer versions of gcc tend to out-benchmark old ones (on Phoronix), I think it’s safe to say that any language not enjoying those improvements would fall behind. And that those improvements are probably mostly motivated by the popularity of C and C++.

Similar is the case for LLVM, where it’s probably mostly interest in Clang and Clang++.
But I’m glad! All the work they’ve done to make LLVM great has made Julia possible.

Exactly, more eventually complex code to execute

Or simpler! If you’re doing things at compile time, you aren’t doing them at runtime ;).
Like the @boundscheck and everything following it completely disappearing when you use @inbounds (which also disappears, after taking the boundschecks with it).
Similarly, the generated functions multiply M and N at compile time.

The @generated in the matrix type is what I do not really understand yet, I need to learn this from the documentation. The extensions of the Base getindex/setindex methods is something I mostly understand, but will not fully understand without having understood the type.

For the static matrix, the total number of elements has to be known. I guess I could have done it as data::NTuple{N,NTuple{M,T}}, and then I wouldn’t have needed the L. That side, it isn’t nice to specify redundant info, and you do want things type stable. So I used a generated function to “generate” a function giving the correct L (total elements), given M (number of rows) and N (number of columns).
When you write a generated function, you normally return an expression. That expression then gets treated as though it were the actual function, and is then compiled and run.
So you can then take advantage of whatever compile-time info you’d like in building the expression you want to actually run. In this case, uncreatively, I just used the M and N known at compile time to insert (M*N) into the expression, rather than calculate it. Something the compiler may have done anyway on 0.7, but because L is part of the type, I wanted to guarantee type stability.

I’d recommend the documentation on meta-programming if you want to learn more about expressions, macros, and generated functions:
https://docs.julialang.org/en/latest/manual/metaprogramming/

Hello,

OK, after sleeping over this I see that I did not fully understand what you wrote. The compiler optimization does not kick in suddenly, but the branch in the code gets created bit the if anyway and branching does not cost anything anymore. It is in some sense coincidence, two different mechanisms at two separate levels of the execution work well together. Stupid me !

Yes, well written C (and C++ once the language, the libraries and the programmers matured) was and is competetive. There is no question about that.

Compared to current gfortran and ifort it is very buggy and incomplete in our limited experience, we could not get our inhouse science code to compile with it. On a big installation you have either icc/ifort or I suspect the IBM compilers (xlf) anyway, which optimize aggressively and are paired with nice optimized math libraries and you have MPIs for the high performance interconnect etc. which you can link against with these compilers. You probably find a gcc/gfortran or python module too, but why bother with flang if you have a compute job to run ?

I read that thread but the language usage (@generated) was a bit beyond my capabilities right now.

gfortran (or any other language listed on gcc.gnu.org) are just frontends for the same middle end and backend ? The backend improvements are motivated by the popularity of the GNU Compiler Collection (gcc) as a whole regardless of frontend language IMO. The frontends produce intermediate code which contains optimization hints for the later stages in my understanding. A badly written frontend might fail to pass hints. But after the frontend has done its job “language” in the sense of C/C++/fortran looses its meaning rapidly ? So that is why I object to the exact wording of your statement.

I already have it open, thank you. These are new concepts for me, and I now understand that I will have to learn them to be efficient.

Best Regards

Christof