# Will decreasing the precision of intermediate variables improve performance of code?

I found from a Youtube video that half-precision is significantly faster than single and double precision. [https://www.youtube.com/watch?v=Q9OLOqEhc64&t=64s](2018 | Keynote - Tricks and Tips in Numerical Computing | Nick Higham)

Does it make sense to convert all parameters used in code to the datatype which occupies the least amount of memory? Should I convert the type of any variable to the one that occupies the lowest space to improve performance? (`Int64 -> Int32 -> Int16` or `Float64 -> Float32`)

For example, Consider a code which creates a dataset. If my code accepts floating-point numbers which can be of type Float32, but the final dataset created is of type `Float64`, is it a good idea to keep the format of those integers to `Float32` or `Float64`?

[EDIT]
The origin of this question comes from the non-linear decrease in performance seen in my experiments with making an array of random numbers.

``````@benchmark rand(Int8,10000000)
BenchmarkTools.Trial:
memory estimate:  9.54 MiB
allocs estimate:  2
--------------
minimum time:     2.326 ms (0.00% GC)
median time:      2.510 ms (0.00% GC)
mean time:        2.660 ms (4.29% GC)
maximum time:     15.080 ms (3.74% GC)
--------------
samples:          1872
evals/sample:     1
rand
@benchmark rand(Int16,10000000)
BenchmarkTools.Trial:
memory estimate:  19.07 MiB
allocs estimate:  2
--------------
minimum time:     4.163 ms (0.00% GC)
median time:      7.763 ms (0.00% GC)
mean time:        13.422 ms (8.55% GC)
maximum time:     88.896 ms (92.07% GC)
--------------
samples:          374
evals/sample:     1
rand
@benchmark rand(Int32,10000000)
BenchmarkTools.Trial:
memory estimate:  38.15 MiB
allocs estimate:  2
--------------
minimum time:     47.230 ms (0.00% GC)
median time:      53.894 ms (0.00% GC)
mean time:        54.947 ms (8.60% GC)
maximum time:     126.817 ms (55.10% GC)
--------------
samples:          91
evals/sample:     1
rand
@benchmark rand(Int64,10000000)
BenchmarkTools.Trial:
memory estimate:  76.29 MiB
allocs estimate:  2
--------------
minimum time:     93.666 ms (0.00% GC)
median time:      108.686 ms (11.71% GC)
mean time:        112.493 ms (9.26% GC)
maximum time:     229.811 ms (41.48% GC)
--------------
samples:          45
evals/sample:     1
``````

Smaller precision can improve performance, but can also decrease performance. Not all processors have the same hardware support for all sorts of floats, and some iterstive algorithms need to run longer if the precision is different, so itâs hard to say something in general. What can be said is that for lower-precision numbers, more of them fit into any given level of cache, which is a good thing. If a program spends significant amounts of time doing SIMD operations, those will also be sped up by going to Float32 instead of Float64, but Iâm not sure that holds for float16.

1 Like

I did part of my dissertation with `Float32`, because the data was stored and given to me in that format, and because I could ensure all the most computationaly intense parts were SIMD.

I regret not immediately promoting to `Float64` at the start of the analysis. An endless stream of bugs caused be the low precision, and then a lot of checks needed to be added to look for and address them. A solution that always worked was to reanalyze the same data as `Float64` . It would have definitely saved a lot of human time to do that from the start. Probably would have saved a lot of computer time too, in the end.
(Edit: Cholesky factorization of 3x3 matrices was the most problematic part of the code. Positive definite matrices would routinely fail in single precision.)
I tried to make it work for as long as I did because I was curious precisely about this idea. I donât regret trying, but the lesson from the experience was abundantly clear: use `Float64`.

Of course, this is very work-load dependent. Statistics, solving of equations, factorization, pdes, etc require high precision. Many machine learning algorithms donât, and rather than suffering from convergence issues or critical failures, some even call it âfree regularizationâ.

6 Likes

In my experience the only valid reason so far to switch from `Float64` to `Float32` in very large datasets is to halve the required space (most importantly that of the video memory in GPU-based applications).

1 Like

Data storage: maybe, if storage space is a concern, either in memory or the disk.

Intermediate calculations: only if you have the skills and the time to invest into thinking the error analysis through. This may make sense for climate calculations running on supercomputer clusters, but for your typical numerical calculation that runs on a few cores, it is usually not worth it.

1 Like

Hereâs an example of performances getting worse when performing `Float32` math involving an `Int64` loop variable.

The solution in that case was to explicitly use an `Int32` loop variable:

``````for i in Int32(1):N
...
end
``````

However that `Int32(1)` is rather unpleasant to look at and, notably, very easy to forget with disastrous consequences.

1 Like

Is it possible that positive definite matrices would âroutinely failâ in single precision because the function (in this case, Cholesky factorisation) used in the code chooses tolerence and other parameters w.r.t. a `Float64` type input? I have a feeling that if we type the entire code using `Float32` values, then the performance may improve. For example, since machine learning algorithms typically perform better in `Float32` because the entire training process is written accordingly.

Is the decrease in performance in your code because of implicit type conversion in the code? Also, what are the possible âdisastrous consequencesâ?

No, it just happens that CPUs with a certain instruction sets have a native instruction to convert multiple Int32 into floats in one instruction while they donât have one for Int64.

The disastrous consequences is just a slowdown due to not using SIMD anymore.

1 Like

Here is an example of code that would routinely fail with `Float32` but not `Float64`. The input is the matrix `BPP`. Each row is a separate data point, while columns 1-3 correspend to a vector, and columns 5-10 the upper triangle fo a symmetric matrix `S`. The function solves for `x` in `Lx = y`, where `L * L' = S`:

``````function process_big_prop_points!(X::AbstractMatrix{T}, BPP::AbstractMatrix{T}) where {T}
@fastmath @inbounds for i â 1:size(BPP,1)
y1 = BPP[i,1]
y2 = BPP[i,2]
y3 = BPP[i,3]

R11 = one(T) / sqrt(BPP[i,5])
x1 = R11 * y1
L21 = R11 * BPP[i,6]
L31 = R11 * BPP[i,8]
R22 = one(T) / sqrt(BPP[i,7] - L21*L21)
x2 = R22 * (y2 - L21*x1)
L32 = R22 * (BPP[i,9] - L21 * L31)
R33 = one(T) / sqrt(BPP[i,10] - L31*L31 - L32*L32)

X[i,1] = x1
X[i,2] = x2
X[i,3] = R33 * (y3 - L31*x1 - L32*x2)
end
end
``````

You can see the vectorization easily by looking at the LLVM:

code_llvm

`Float64`:

``````; julia> code_llvm(process_big_prop_points!, Tuple{Matrix{Float64},Matrix{Float64}}, debuginfo=:none)
define nonnull %jl_value_t* @"japi1_process_big_prop_points!_5934"(%jl_value_t* %0, %jl_value_t** %1, i32 %2) #0 {
top:
%3 = alloca %jl_value_t**, align 8
store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
%4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %1, i64 1
%5 = load %jl_value_t*, %jl_value_t** %4, align 8
%6 = bitcast %jl_value_t* %5 to %jl_value_t**
%7 = getelementptr inbounds %jl_value_t*, %jl_value_t** %6, i64 3
%8 = bitcast %jl_value_t** %7 to i64*
%9 = load i64, i64* %8, align 8
%.inv = icmp sgt i64 %9, 0
%10 = select i1 %.inv, i64 %9, i64 0
br i1 %.inv, label %L13.preheader, label %L64

%11 = load %jl_value_t*, %jl_value_t** %1, align 8
%12 = bitcast %jl_value_t* %5 to double**
%13 = load double*, double** %12, align 8
%14 = shl i64 %9, 1
%15 = shl i64 %9, 2
%16 = mul i64 %9, 5
%17 = mul i64 %9, 7
%18 = mul i64 %9, 6
%19 = shl i64 %9, 3
%20 = mul i64 %9, 9
%21 = bitcast %jl_value_t* %11 to double**
%22 = load double*, double** %21, align 8
%23 = bitcast %jl_value_t* %11 to %jl_value_t**
%24 = getelementptr inbounds %jl_value_t*, %jl_value_t** %23, i64 3
%25 = bitcast %jl_value_t** %24 to i64*
%26 = load i64, i64* %25, align 8
%27 = shl i64 %26, 1
br label %L13

L13:                                              ; preds = %L13, %L13.preheader
%value_phi3 = phi i64 [ %80, %L13 ], [ 1, %L13.preheader ]
%28 = add nsw i64 %value_phi3, -1
%29 = getelementptr inbounds double, double* %13, i64 %28
%30 = load double, double* %29, align 8
%31 = add i64 %28, %9
%32 = getelementptr inbounds double, double* %13, i64 %31
%33 = load double, double* %32, align 8
%34 = add i64 %28, %14
%35 = getelementptr inbounds double, double* %13, i64 %34
%36 = load double, double* %35, align 8
%37 = add i64 %28, %15
%38 = getelementptr inbounds double, double* %13, i64 %37
%39 = load double, double* %38, align 8
%40 = call fast double @llvm.sqrt.f64(double %39)
%41 = fdiv fast double 1.000000e+00, %40
%42 = fmul fast double %41, %30
%43 = add i64 %28, %16
%44 = getelementptr inbounds double, double* %13, i64 %43
%45 = load double, double* %44, align 8
%46 = fmul fast double %45, %41
%47 = add i64 %28, %17
%48 = getelementptr inbounds double, double* %13, i64 %47
%49 = load double, double* %48, align 8
%50 = fmul fast double %49, %41
%51 = add i64 %28, %18
%52 = getelementptr inbounds double, double* %13, i64 %51
%53 = load double, double* %52, align 8
%54 = fmul fast double %46, %46
%55 = fsub fast double %53, %54
%56 = call fast double @llvm.sqrt.f64(double %55)
%57 = fdiv fast double 1.000000e+00, %56
%58 = fmul fast double %46, %42
%59 = fsub fast double %33, %58
%60 = fmul fast double %57, %59
%61 = add i64 %28, %19
%62 = getelementptr inbounds double, double* %13, i64 %61
%63 = load double, double* %62, align 8
%64 = fmul fast double %50, %46
%65 = fsub fast double %63, %64
%66 = fmul fast double %65, %57
%67 = add i64 %28, %20
%68 = getelementptr inbounds double, double* %13, i64 %67
%69 = load double, double* %68, align 8
%.neg = fmul fast double %50, %50
%.neg9 = fmul fast double %66, %66
%70 = fsub fast double %69, %reass.add
%71 = call fast double @llvm.sqrt.f64(double %70)
%72 = getelementptr inbounds double, double* %22, i64 %28
store double %42, double* %72, align 8
%73 = add i64 %28, %26
%74 = getelementptr inbounds double, double* %22, i64 %73
store double %60, double* %74, align 8
%.neg10 = fmul fast double %50, %42
%.neg11 = fmul fast double %66, %60
%75 = fsub fast double %36, %reass.add12
%76 = fdiv fast double %75, %71
%77 = add i64 %28, %27
%78 = getelementptr inbounds double, double* %22, i64 %77
store double %76, double* %78, align 8
%79 = icmp eq i64 %value_phi3, %10
%80 = add nuw i64 %value_phi3, 1
br i1 %79, label %L64, label %L13

L64:                                              ; preds = %L13, %top
ret %jl_value_t* inttoptr (i64 139708209337840 to %jl_value_t*)
}
``````

`Float32`:

``````; define nonnull %jl_value_t* @"japi1_process_big_prop_points!_5935"(%jl_value_t* %0, %jl_value_t** %1, i32 %2) #0 {
top:
%3 = alloca %jl_value_t**, align 8
store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
%4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %1, i64 1
%5 = load %jl_value_t*, %jl_value_t** %4, align 8
%6 = bitcast %jl_value_t* %5 to %jl_value_t**
%7 = getelementptr inbounds %jl_value_t*, %jl_value_t** %6, i64 3
%8 = bitcast %jl_value_t** %7 to i64*
%9 = load i64, i64* %8, align 8
%.inv = icmp sgt i64 %9, 0
%10 = select i1 %.inv, i64 %9, i64 0
br i1 %.inv, label %L13.preheader, label %L64

%11 = load %jl_value_t*, %jl_value_t** %1, align 8
%12 = bitcast %jl_value_t* %5 to float**
%13 = load float*, float** %12, align 8
%14 = shl i64 %9, 1
%15 = shl i64 %9, 2
%16 = mul i64 %9, 5
%17 = mul i64 %9, 7
%18 = mul i64 %9, 6
%19 = shl i64 %9, 3
%20 = mul i64 %9, 9
%21 = bitcast %jl_value_t* %11 to float**
%22 = load float*, float** %21, align 8
%23 = bitcast %jl_value_t* %11 to %jl_value_t**
%24 = getelementptr inbounds %jl_value_t*, %jl_value_t** %23, i64 3
%25 = bitcast %jl_value_t** %24 to i64*
%26 = load i64, i64* %25, align 8
%27 = shl i64 %26, 1
br label %L13

L13:                                              ; preds = %L13, %L13.preheader
%value_phi3 = phi i64 [ %80, %L13 ], [ 1, %L13.preheader ]
%28 = add nsw i64 %value_phi3, -1
%29 = getelementptr inbounds float, float* %13, i64 %28
%30 = load float, float* %29, align 4
%31 = add i64 %28, %9
%32 = getelementptr inbounds float, float* %13, i64 %31
%33 = load float, float* %32, align 4
%34 = add i64 %28, %14
%35 = getelementptr inbounds float, float* %13, i64 %34
%36 = load float, float* %35, align 4
%37 = add i64 %28, %15
%38 = getelementptr inbounds float, float* %13, i64 %37
%39 = load float, float* %38, align 4
%40 = call fast float @llvm.sqrt.f32(float %39)
%41 = fdiv fast float 1.000000e+00, %40
%42 = fmul fast float %41, %30
%43 = add i64 %28, %16
%44 = getelementptr inbounds float, float* %13, i64 %43
%45 = load float, float* %44, align 4
%46 = fmul fast float %45, %41
%47 = add i64 %28, %17
%48 = getelementptr inbounds float, float* %13, i64 %47
%49 = load float, float* %48, align 4
%50 = fmul fast float %49, %41
%51 = add i64 %28, %18
%52 = getelementptr inbounds float, float* %13, i64 %51
%53 = load float, float* %52, align 4
%54 = fmul fast float %46, %46
%55 = fsub fast float %53, %54
%56 = call fast float @llvm.sqrt.f32(float %55)
%57 = fdiv fast float 1.000000e+00, %56
%58 = fmul fast float %46, %42
%59 = fsub fast float %33, %58
%60 = fmul fast float %57, %59
%61 = add i64 %28, %19
%62 = getelementptr inbounds float, float* %13, i64 %61
%63 = load float, float* %62, align 4
%64 = fmul fast float %50, %46
%65 = fsub fast float %63, %64
%66 = fmul fast float %65, %57
%67 = add i64 %28, %20
%68 = getelementptr inbounds float, float* %13, i64 %67
%69 = load float, float* %68, align 4
%.neg = fmul fast float %50, %50
%.neg9 = fmul fast float %66, %66
%70 = fsub fast float %69, %reass.add
%71 = call fast float @llvm.sqrt.f32(float %70)
%72 = getelementptr inbounds float, float* %22, i64 %28
store float %42, float* %72, align 4
%73 = add i64 %28, %26
%74 = getelementptr inbounds float, float* %22, i64 %73
store float %60, float* %74, align 4
%.neg10 = fmul fast float %50, %42
%.neg11 = fmul fast float %66, %60
%75 = fsub fast float %36, %reass.add12
%76 = fdiv fast float %75, %71
%77 = add i64 %28, %27
%78 = getelementptr inbounds float, float* %22, i64 %77
store float %76, float* %78, align 4
%79 = icmp eq i64 %value_phi3, %10
%80 = add nuw i64 %value_phi3, 1
br i1 %79, label %L64, label %L13

L64:                                              ; preds = %L13, %top
ret %jl_value_t* inttoptr (i64 139708209337840 to %jl_value_t*)
}
``````

Notice all the `<8 x double>` with `Float64` and `<16 x float>` with `Float32`.

The code was written to be SIMD across iterations of the loop. This means that while what I wrote says `for i â 1:size(BPP,1)`, what actually happens with `Float32` is `for _i â 1:15:size(BPP,1); i = _i:_i+15`, and for `Float64` is `for _i â 1:7:size(BPP,1); i = _i:_i+7`. That is, instead of calculating 1 iteration of a loop at a time, it calculates 16 or 8 of the iterations of the loop in parallel using SIMD instructions.
That is the major advantage of `Float32`: twice as many of them fit in a register, so IF it is using SIMD instructions, itâll be able to calculate twice as much at a time.
This, and less cache pressure/memory bandwidth needed (for many problems, thatâs a much bigger concern). They are the two main reasons why `Float32` is faster for the above example.

A second reason for greater performance in that particular example isnât generally applicable, but discussed below for the sake of thoroughness.

code_native

Sometimes, specialized CPU instructions exist for certain data types but not others. This is a less extreme example than gcalderoneâs.

For example, my CPU has a âapproximate inverse square rootâ instruction for `Float32`, but not for `Float64`. So for the `Float32` case, the compiler can use this approximate instruction, and then add a Newton-Raphson iteration to make it more accurate. However, for `Float64`, the compiler is forced to use the square root instruction and then divide.
I did try square root and division instructions with `Float32` as well. This took twice as long IIRC, and wasnt really more accurate.
`Float64`:

``````# julia> code_native(process_big_prop_points!, Tuple{Matrix{Float64},Matrix{Float64}}, debuginfo=:none)
.text
pushq   %rbp
pushq   %r15
pushq   %r14
pushq   %r13
pushq   %r12
pushq   %rbx
movq    %rsi, -8(%rsp)
movq    8(%rsi), %rax
movq    24(%rax), %rdi
testq   %rdi, %rdi
jle     L300
movq    %rdi, %rcx
sarq    \$63, %rcx
andnq   %rdi, %rcx, %r8
movq    (%rsi), %rcx
movq    (%rax), %r15
movq    (%rcx), %r9
movq    24(%rcx), %r14
leaq    (%r9,%r14,8), %r10
shlq    \$4, %r14
leaq    (%rdi,%rdi,8), %rax
leaq    (%r15,%rax,8), %r11
movq    %rdi, %r12
shlq    \$6, %r12
leaq    (%rdi,%rdi,2), %rsi
shlq    \$4, %rsi
imulq   \$56, %rdi, %rbx
leaq    (%rdi,%rdi,4), %rax
leaq    (%r15,%rax,8), %r13
movq    %rdi, %rax
shlq    \$5, %rax
leaq    (%r15,%rdi,8), %rdx
shlq    \$4, %rdi
xorl    %ecx, %ecx
movabsq \$.rodata.cst8, %rbp
vmovsd  (%rbp), %xmm0           # xmm0 = mem[0],zero
nopw    %cs:(%rax,%rax)
L160:
vmovsd  (%rax,%rcx,8), %xmm1    # xmm1 = mem[0],zero
vsqrtsd %xmm1, %xmm1, %xmm1
vdivsd  %xmm1, %xmm0, %xmm1
vmulsd  (%r15,%rcx,8), %xmm1, %xmm2
vmulsd  (%r13,%rcx,8), %xmm1, %xmm3
vmulsd  (%rbx,%rcx,8), %xmm1, %xmm1
vmovsd  (%rsi,%rcx,8), %xmm4    # xmm4 = mem[0],zero
vfnmadd231sd    %xmm3, %xmm3, %xmm4 # xmm4 = -(xmm3 * xmm3) + xmm4
vsqrtsd %xmm4, %xmm4, %xmm4
vdivsd  %xmm4, %xmm0, %xmm4
vmovsd  (%rdx,%rcx,8), %xmm5    # xmm5 = mem[0],zero
vfnmadd231sd    %xmm2, %xmm3, %xmm5 # xmm5 = -(xmm3 * xmm2) + xmm5
vmulsd  %xmm5, %xmm4, %xmm5
vfnmadd213sd    (%r12,%rcx,8), %xmm1, %xmm3 # xmm3 = -(xmm1 * xmm3) + mem
vmulsd  %xmm4, %xmm3, %xmm3
vmulsd  %xmm1, %xmm1, %xmm4
vfnmsub231sd    %xmm3, %xmm3, %xmm4 # xmm4 = -(xmm3 * xmm3) - xmm4
vsqrtsd %xmm4, %xmm4, %xmm4
vmulsd  %xmm2, %xmm1, %xmm1
vfnmsub213sd    %xmm1, %xmm5, %xmm3 # xmm3 = -(xmm5 * xmm3) - xmm1
vdivsd  %xmm4, %xmm1, %xmm1
vmovsd  %xmm2, (%r9,%rcx,8)
vmovsd  %xmm5, (%r10,%rcx,8)
vmovsd  %xmm1, (%r14,%rcx,8)
incq    %rcx
cmpq    %rcx, %r8
jne     L160
L300:
movabsq \$jl_system_image_data, %rax
popq    %rbx
popq    %r12
popq    %r13
popq    %r14
popq    %r15
popq    %rbp
retq
``````

`Float32`

``````        .text
pushq   %rbp
pushq   %r15
pushq   %r14
pushq   %r13
pushq   %r12
pushq   %rbx
movq    %rsi, -8(%rsp)
movq    8(%rsi), %rax
movq    24(%rax), %rbx
testq   %rbx, %rbx
jle     L343
movq    %rbx, %rcx
sarq    \$63, %rcx
andnq   %rbx, %rcx, %r8
movq    (%rsi), %rdx
movq    (%rax), %rcx
movq    (%rdx), %r14
movq    24(%rdx), %rax
leaq    (%r14,%rax,8), %r9
leaq    (%r14,%rax,4), %r10
leaq    (%rbx,%rbx,8), %rax
leaq    (%rcx,%rax,4), %r11
movq    %rbx, %r13
shlq    \$5, %r13
leaq    (%rbx,%rbx,2), %rdx
leaq    (%rcx,%rdx,8), %r15
leaq    (%rax,%rax,2), %rax
leaq    (%rbx,%rbx,4), %rdx
leaq    (%rcx,%rdx,4), %r12
movq    %rbx, %rdx
shlq    \$4, %rdx
leaq    (%rcx,%rbx,8), %rdi
leaq    (%rcx,%rbx,4), %rbx
xorl    %esi, %esi
movabsq \$.rodata.cst4, %rbp
vmovss  (%rbp), %xmm0           # xmm0 = mem[0],zero,zero,zero
movabsq \$139707740336728, %rbp  # imm = 0x7F103E3C3258
vmovss  (%rbp), %xmm1           # xmm1 = mem[0],zero,zero,zero
nop
L160:
vmovss  (%rdx,%rsi,4), %xmm2    # xmm2 = mem[0],zero,zero,zero
vrsqrtss        %xmm2, %xmm2, %xmm3
vmulss  %xmm3, %xmm2, %xmm2
vfmadd213ss     %xmm0, %xmm3, %xmm2 # xmm2 = (xmm3 * xmm2) + xmm0
vmulss  %xmm1, %xmm3, %xmm3
vmulss  %xmm2, %xmm3, %xmm2
vmulss  (%rcx,%rsi,4), %xmm2, %xmm3
vmulss  (%r12,%rsi,4), %xmm2, %xmm4
vmulss  (%rax,%rsi,4), %xmm2, %xmm2
vmovss  (%r15,%rsi,4), %xmm5    # xmm5 = mem[0],zero,zero,zero
vfnmadd231ss    %xmm4, %xmm4, %xmm5 # xmm5 = -(xmm4 * xmm4) + xmm5
vrsqrtss        %xmm5, %xmm5, %xmm6
vmulss  %xmm6, %xmm5, %xmm5
vfmadd213ss     %xmm0, %xmm6, %xmm5 # xmm5 = (xmm6 * xmm5) + xmm0
vmulss  %xmm1, %xmm6, %xmm6
vmulss  %xmm5, %xmm6, %xmm5
vmovss  (%rbx,%rsi,4), %xmm6    # xmm6 = mem[0],zero,zero,zero
vfnmadd231ss    %xmm3, %xmm4, %xmm6 # xmm6 = -(xmm4 * xmm3) + xmm6
vfnmadd213ss    (%r13,%rsi,4), %xmm2, %xmm4 # xmm4 = -(xmm2 * xmm4) + mem
vmulss  %xmm6, %xmm5, %xmm6
vmulss  %xmm5, %xmm4, %xmm4
vmulss  %xmm2, %xmm2, %xmm5
vfnmsub231ss    %xmm4, %xmm4, %xmm5 # xmm5 = -(xmm4 * xmm4) - xmm5
vrsqrtss        %xmm5, %xmm5, %xmm7
vmulss  %xmm7, %xmm5, %xmm5
vfmadd213ss     %xmm0, %xmm7, %xmm5 # xmm5 = (xmm7 * xmm5) + xmm0
vmulss  %xmm1, %xmm7, %xmm7
vmulss  %xmm5, %xmm7, %xmm5
vmulss  %xmm3, %xmm2, %xmm2
vfnmsub213ss    %xmm2, %xmm6, %xmm4 # xmm4 = -(xmm6 * xmm4) - xmm2
vmulss  %xmm5, %xmm2, %xmm2
vmovss  %xmm3, (%r14,%rsi,4)
vmovss  %xmm6, (%r10,%rsi,4)
vmovss  %xmm2, (%r9,%rsi,4)
incq    %rsi
cmpq    %rsi, %r8
jne     L160
L343:
movabsq \$jl_system_image_data, %rax
popq    %rbx
popq    %r12
popq    %r13
popq    %r14
popq    %r15
popq    %rbp
retq
``````

I have two points with that example:

1. To actually benefit from SIMD, you must actually make sure that the code is getting vectorized. That will require you to choose data layouts accordingly, and make sure the autovectorizer is actually triggering â itâs easy to make changes to the above code and no longer get vectorization (or, at least, it was at the time â Juliaâs gotten a better since then). And it isnât just Julia â gfortran, for example, had two associated performance bugs, one of which resulted in, starting with the newly released gcc10, gfortran now inlining packing by default, and providing a flag `-fno-inline-arg-packing` to control that behavior. If you donât pay careful attention to this, it is easy to lose any potential performance benefit. Serial `Float64` instructions are just as fast as serial `Float32` instructions. Same story if youâre calling library routines that arenât paying attention either.
1. That there were no tolerance parameters, it was just arithmetic. The problem is that `BPP[i,10] - L31*L31 - L32*L32` often goes from being greater than zero with `Float64` to less than zero with `Float32`. Many of the matrices werenât especially well conditioned. Plus, there was a Gibbs Sampler involved.
2 Likes

Isnât that simply because gpus are extremely much faster on Float32, while this is not the case for cpus?

2 Likes

Even when comparing 2 GPUs which are compatible with `Float32` and `Float64` types, we would find the former faster than latter. Of course, we cannot compare GPU to CPU for performance gains in machine learning. I further think that CPU may be faster using `Float32` rather than `Float64`, even when GPUs are not used.
[Reference]