Realize also that the whole vectorized style is probably leading you astray here.

If I’m not mistaken, it looks to me like nearly the whole body of the while t < end_t loop can be written as a single loop, and that several of the temporary arrays (like surface_gradient_s and new_thick) can be eliminated entirely. Not only will the code be (probably) significantly faster, but I’m betting it will also be clearer.

Matlab/Numpy really trains people in a contorted style of programming.

LoopVectorization.jl targets Intel’s AVX, i.e. SIMD registers, and you get practically more operations with Float32 (i.e. lower precision), I think that’s the point of @turbo. Yes, you run as many instructions, but you work on twice as many operands in each register, so in effect more operations.

And even if not, when you’re memory-bound, not CPU-bound, the memory usage is important.

And you really should look towards the future, where there is no Float64 (EDIT: I mean in this new optimized instruction set, it would still exist as fallback in the default set), in Intel AMX instructions (not even Float32 except for accumulator)

Intel has an x64 instruction set extension called AMX, meanwhile Apple has a totally different aarch64 instruction set extension also called AMX.
[…]
Intel AMX: Multiplicands are always 32-bit, either i8[4] or u8[4] or bf16[2], combined via dot product. Accumulators are always 32-bit, either i32 or u32 or f32.

Apple AMX: Multiplicands are scalars, any of i8 / u8 / i16 / u16 / f16 / f32 / f64. Accumulators are any of i16 / u16 / i32 / u32 / f16 / f32 / f64. Note f16 (i.e. IEEE 754 half-precision with 5-bit exponent and 10-bit fraction) rather than bf16 (8-bit exponent, 7-bit fraction).

I bit off-topic here, but my “At best Float32 will give you 2x speedup on CPUs” might be wrong, I at least see (more than 2x faster, for half the size, is likely an artifact of neural networks):

INT4 Precision Can Bring an Additional 59% Speedup Compared to INT8

INT8 provides better performance with comparable precision than floating point for AI inference. But when INT8 is unable to meet the desired performance with limited resources, INT4 optimization is the answer. This INT4 optimization achieves up to a 77% performance boost on real hardware in comparison with the current INT8 solution.

Float32 can also help with cache locality - at various array sizes Float64 arrays may no longer fit in a cache, moving access from L1 to L2, or L2 to L3 caches, while the Float32 version of the same array will still fit in the smaller more localized cache.

Float64 isn’t going anywhere. That said, HPC is starting to augment Float64 with lower precisions in some places. One example of this is iterative methods. Traditionally, all the computation would be done in full Float64 precision, but recently there have been good results demonstrated from doing initial steps in lower precision and moving to higher precision as the error decreases.

Matlab/Numpy really trains people in a contorted style of programming.

Just for clarification, this little experiment started as a translation from a Python script. Just for fun So this explains why many parts still look quite a lot like Python.

I already managed to get a 30x acceleration, so I’m more than happy so far with that for my needs. Readability is also important in my context (i.e. including other Earth sciences users). So some of the suggested improvements are just a bit too convoluted to make sense for this case.

Anyways, thanks everybody for the feedback. I really learnt quite a lot with this.