I was just playing around with the best way to compute the mean and standard deviation of a vector of values.
Obviously, sum(x), std(x)
is the cleanest but it isn’t the fastest, as I assume that approach needs to scan through the vector x
twice. We can easily just scan through it once and then calculate it.
From my testing, the function ms
is the fastest as it uses tricks available for indexing (something about avx instructions on CPUs). This trick is NOT available for the functional programming paradigm using reduce
. I tried the VectorisedReduction.jl package but it didn’t work (see ms5
).
It’s kinda sad that for Julia to hit top performance, we need to use “vectorized” versions of the code. This reminds me of Python/numpy and R. Also echoing the 1.5 language problem talk in the 2023 JuliaCon (sorry, I don’t know who the presenter is here), optimised Julia code is often very alien-looking vs “regular” Julia code to the point where you might as well call it a descendent language. For the record, when I tested the perform of Python vs Julia (in the past), I could get Numba to hit the same speed as optimised Julia. Now that could be because my problem was too “toyish” so I didn’t reveal the sharp edges of Numba but writing Numba in Python is akin to writing highly optimised code in Julia it’s kinda in the 1.5 language problem arena.
With Julia, you can use the GPU much more easily due to the usage of CUDA.jl but I think it’s still true that to get the most out of CUDA you would need to go into the CUDA C++ extensions and can’t rely on Julia’s just yet. I might be wrong as I haven’t been closely watching but that’s the kind of gist I get.
Are there even faster ways to optimise this?
My code for `sum(x) and std(x)`
using LoopVectorization
using VectorizedReduction: vvreduce
using VectorizedStatistics: vsum, vstd
function ms(x)
m = 0.0
s = 0.0
@turbo for i in eachindex(x)
m+=x[i]
s+=x[i]^2
end
m,s
n = length(x)
m, (s - n*m^2)/(n-1)
end
function ms2(x)
m, s = reduce(x; init=(0.0,0.0)) do (m, s), val
(m+=val, s += val^2)
end
n = length(x)
m, (s - m^2/n) / (n - 1)
end
function ms3(x)
sum(x), sum(x .^ 2)
end
function ms4(x)
sum(x), reduce((x,y)->x+y^2, x; init=0.0)
end
function ms5(x)
vvreduce(x; init=(0.0, 0.0)) do (m, s), val
(m += val, s += val^2)
end
end
function ms6(x)
vsum(x), vstd(x)
end
using BenchmarkTools
x = rand(1_000_000);
@benchmark ms($x)
@benchmark ms2($x)
@benchmark ms3($x)
@benchmark ms4($x)
@benchmark ms5($x)
@benchmark ms6($x)
ms(x)
ms2(x)
ms3(x)
ms4(x)
@code_warntype ms(x)
@code_warntype ms2(x)
@code_warntype ms3(x)
@code_warntype ms4(x)
edit fixed bug