Moving from `Float64` to `Float32` not improving performance

stevengj · October 10, 2022, 1:31pm

Realize also that the whole vectorized style is probably leading you astray here.

If I’m not mistaken, it looks to me like nearly the whole body of the while t < end_t loop can be written as a single loop, and that several of the temporary arrays (like surface_gradient_s and new_thick) can be eliminated entirely. Not only will the code be (probably) significantly faster, but I’m betting it will also be clearer.

Matlab/Numpy really trains people in a contorted style of programming.

Palli · October 10, 2022, 2:45pm

LoopVectorization.jl targets Intel’s AVX, i.e. SIMD registers, and you get practically more operations with Float32 (i.e. lower precision), I think that’s the point of @turbo. Yes, you run as many instructions, but you work on twice as many operands in each register, so in effect more operations.

And even if not, when you’re memory-bound, not CPU-bound, the memory usage is important.

And you really should look towards the future, where there is no Float64 (EDIT: I mean in this new optimized instruction set, it would still exist as fallback in the default set), in Intel AMX instructions (not even Float32 except for accumulator)

https://www.corsix.org/content/contrasting-intel-amx-and-apple-amx

Intel has an x64 instruction set extension called AMX, meanwhile Apple has a totally different aarch64 instruction set extension also called AMX.
[…]
Intel AMX: Multiplicands are always 32-bit, either i8[4] or u8[4] or bf16[2], combined via dot product. Accumulators are always 32-bit, either i32 or u32 or f32.

Apple AMX: Multiplicands are scalars, any of i8 / u8 / i16 / u16 / f16 / f32 / f64. Accumulators are any of i16 / u16 / i32 / u32 / f16 / f32 / f64. Note f16 (i.e. IEEE 754 half-precision with 5-bit exponent and 10-bit fraction) rather than bf16 (8-bit exponent, 7-bit fraction).

I bit off-topic here, but my “~~At best Float32 will give you 2x speedup on CPUs~~” might be wrong, I at least see (more than 2x faster, for half the size, is likely an artifact of neural networks):

INT4 Precision Can Bring an Additional 59% Speedup Compared to INT8

INT8 provides better performance with comparable precision than floating point for AI inference. But when INT8 is unable to meet the desired performance with limited resources, INT4 optimization is the answer. This INT4 optimization achieves up to a 77% performance boost on real hardware in comparison with the current INT8 solution.

Raf · October 10, 2022, 2:54pm

Float32 can also help with cache locality - at various array sizes Float64 arrays may no longer fit in a cache, moving access from L1 to L2, or L2 to L3 caches, while the Float32 version of the same array will still fit in the smaller more localized cache.

Oscar_Smith · October 10, 2022, 3:09pm

Float64 isn’t going anywhere. That said, HPC is starting to augment Float64 with lower precisions in some places. One example of this is iterative methods. Traditionally, all the computation would be done in full Float64 precision, but recently there have been good results demonstrated from doing initial steps in lower precision and moving to higher precision as the error decreases.

JordiBolibar · October 10, 2022, 3:15pm

Matlab/Numpy really trains people in a contorted style of programming.

Just for clarification, this little experiment started as a translation from a Python script. Just for fun So this explains why many parts still look quite a lot like Python.

I already managed to get a 30x acceleration, so I’m more than happy so far with that for my needs. Readability is also important in my context (i.e. including other Earth sciences users). So some of the suggested improvements are just a bit too convoluted to make sense for this case.

Anyways, thanks everybody for the feedback. I really learnt quite a lot with this.

Elrod · October 10, 2022, 5:20pm

For me, AMX is not the future until it supports Float64 or some type with comparable properties.

Topic		Replies	Views
Will decreasing the precision of intermediate variables improve performance of code? Performance performance	12	1171	May 9, 2020
Float32 performance fluctuating over time Performance	1	290	October 22, 2021
Massive performance penalty for Float16 compared to Float32 Performance performance	17	8050	June 20, 2022
Why does exporting a single Float64 from a function take as long as 2,000,000 addition steps Performance	21	1165	May 9, 2019
Optimizing nested for loops and memory allocation in finite differences routine Performance	3	278	September 19, 2022

Moving from `Float64` to `Float32` not improving performance

INT4 Precision Can Bring an Additional 59% Speedup Compared to INT8

Related topics