The cost of size() in for-loops

I never noticed that the allocatable/ assumed shape arrays in f90 are slower than static size arrays/ assumed size arrays in f77

Are they? I’m surprised, I’d have thought dimensions would be checked at compile time, when possible?
Maybe this was once true, but not anymore?

That is what I was referring to. Also, back then number crunching in C was more of an curiosity (well written C was and is competetive) and no one took C++ seriously, it was painfully slow. So highly optimizing C/C++ compilers as a base for performant fortran was not always true (I suspect they even then shared backends).

Today we are switching languages by choosing the frontend, re-using middle and backends, making life easier for the compiler developers. But the pointer aliasing problematic in C/C++ is still a potential issue which fortran avoids. So it is more about code reuse in the compiler, the language performance happens at the higher level in my experience (and is nowadays mostly comparable).

C and C++ are normally vectorized fine, despite the lack of aliasing guarantees.
On the other hand, I also noticed an example where the Flang Fortran compiler acted as though there were possible aliasing, and generated bad code. flang vs. GCC (gfortran) 7.3 performance · Issue #504 · flang-compiler/flang · GitHub Maybe that’s immaturity in the front end, though.
F/Clang(++) vs gcc makes a much bigger difference than C vs C++ vs Fortran in my experimenting, when code is written the same way. I would have said LLVM vs gcc, but I’ve had a much easier time getting Julia to do what I want.
There’s bias there, in that I’m most familiar with Julia so my mental models of what works are based mostly on experiences with Julia, but I think Julia’s predictability helps.

In my experimenting with matrix multiplication (where I did get single threaded performance in Julia to roughly tie or beat OpenBLAS @inbounds: is the compiler now so smart that this is no longer necessary? - #32 by Elrod , even though I haven’t gotten around to implementing some basic memory optimizations [namely, recursion]), I wanted to try Fortran to get the idea of behavior of more than one compiler. Through iterations of the compute kennels, it was striking that sometimes Flang failed to optimize correctly, and sometimes gfortran did. But Julia always generated the optimal assembly (under the constraint of how that kernel was written).
My suspicion is that Fortran’s “everything is a reference” risks tripping up the compiler into not eliminating what should be temporary variables, and transferring memory in and out of cpu registers more than necessary.
Julia’s clear distinction between static, immutable, structs and head allocated arrays helps.

Although in some other cases, that Fortran’s arrays are generally stack allocated helps Fortran’s.

Maybe C and C++ would be more like Julia. I chose Fortran as the comparation because I didn’t want to deal with row-major 0-indexed arrays, and it wasn’t hard to write a Julia → Fortran transpiler for simple kernel expressions, while C/C++ syntax is much less compatible.
But I’m curious enough to manually translate a few kernels to see how C and C++ behave.

Anyway, the gcc manual lists the optimizations it can perform:

That is a massive list of optimizations – applied to each of the languages. (-fstack-arrays is Fortran-specific, but statically sized arrays in C/C++ are stored on the stack regardless of flag).

This was my point. A huge number of optimizations, related to vectorization among other things, almost all shared by the languages. Because many are recent, and newer versions of gcc tend to out-benchmark old ones (on Phoronix), I think it’s safe to say that any language not enjoying those improvements would fall behind. And that those improvements are probably mostly motivated by the popularity of C and C++.

Similar is the case for LLVM, where it’s probably mostly interest in Clang and Clang++.
But I’m glad! All the work they’ve done to make LLVM great has made Julia possible.

Exactly, more eventually complex code to execute

Or simpler! If you’re doing things at compile time, you aren’t doing them at runtime ;).
Like the @boundscheck and everything following it completely disappearing when you use @inbounds (which also disappears, after taking the boundschecks with it).
Similarly, the generated functions multiply M and N at compile time.

The @generated in the matrix type is what I do not really understand yet, I need to learn this from the documentation. The extensions of the Base getindex/setindex methods is something I mostly understand, but will not fully understand without having understood the type.

For the static matrix, the total number of elements has to be known. I guess I could have done it as data::NTuple{N,NTuple{M,T}}, and then I wouldn’t have needed the L. That side, it isn’t nice to specify redundant info, and you do want things type stable. So I used a generated function to “generate” a function giving the correct L (total elements), given M (number of rows) and N (number of columns).
When you write a generated function, you normally return an expression. That expression then gets treated as though it were the actual function, and is then compiled and run.
So you can then take advantage of whatever compile-time info you’d like in building the expression you want to actually run. In this case, uncreatively, I just used the M and N known at compile time to insert (M*N) into the expression, rather than calculate it. Something the compiler may have done anyway on 0.7, but because L is part of the type, I wanted to guarantee type stability.

I’d recommend the documentation on meta-programming if you want to learn more about expressions, macros, and generated functions:
https://docs.julialang.org/en/latest/manual/metaprogramming/