I’ve been playing around calculating this expression:
P = b * ((rho/rho0)^gamma - 1)
Where values for b are shown how to be calculated in the code (see bottom spoiler). Running the different variants I have of the code, the minimum times I get:
Minimum Time EOS1:
Trial(62.999 μs)
Minimum Time EOS2:
Trial(68.600 μs)
Minimum Time EOS3:
Trial(69.401 μs)
Minimum Time EOS1AVX:
Trial(61.799 μs)
Minimum Time EOS1SIMILAR:
Trial(63.399 μs)
Where EOS1 which is written in the most “naive” way, seems to perform the best. For fun and curiosity, would anyone know how to further speed this up?
Thanks for reading.
Code
using BenchmarkTools
using LoopVectorization
rho = rand(990:rand():1010,10^3)
function EOS1(rho;c0=100,gamma=7,rho0=1000)
b=(c0^2*rho0)/gamma
P = b*((rho/rho0).^gamma .- 1)
end
function EOS1SIMILAR(rho;c0=100,gamma=7,rho0=1000)
b=(c0^2*rho0)/gamma
P = similar(rho)
P .= b*((rho/rho0).^gamma .- 1)
end
function EOS2(rho;c0=100,gamma=7,rho0=1000)
b=(c0^2*rho0)/gamma
rho0_denominator = 1/(rho0^gamma);
P = b*(rho.^gamma * rho0_denominator .- 1)
end
function EOS3(rho;c0=100,gamma=7,rho0=1000)
b=(c0^2*rho0)/gamma
rho0_denominator = 1/(rho0^gamma);
P = rho.^gamma * rho0_denominator * b .- b
end
bEOS1 = @benchmark EOS1($rho)
println("Minimum Time EOS1:\n",bEOS1)
bEOS2 = @benchmark EOS2($rho)
println("Minimum Time EOS2:\n",bEOS2)
bEOS3 = @benchmark EOS3($rho)
println("Minimum Time EOS3:\n",bEOS3)
bEOS1AVX = @benchmark @avx EOS1($rho)
println("Minimum Time EOS1AVX:\n",bEOS1AVX)
bEOS1SIMILAR = @benchmark EOS1SIMILAR($rho)
println("Minimum Time EOS1SIMILAR:\n",bEOS1SIMILAR)
Use more dots. In all functions, only rho seems to be a vector, while all other objects are scalars. Dots in the same expression are only fused into a single loop if all operations that are touched are also dotted. If they’re seperated, the broadcast will have to materialize an array first by allocating and creating more loops than necessary.
Another way would be to use @..
E.g.
function EOS4(rho; c0=100, gamma=7, rho0=1000)
b=(c0^2*rho0)/gamma
rho0_denominator = 1/(rho0^gamma);
return rho .^ gamma .* rho0_denominator .* b .- b
end
You seem to have tried some avx variant, but is it there? Here @turbo makes a huge difference:
julia> using LoopVectorization
julia> function EOS1(rho;c0=100,gamma=7,rho0=1000)
b=(c0^2*rho0)/gamma
P = b*((rho/rho0).^gamma .- 1)
end
EOS1 (generic function with 1 method)
julia> function EOS1_turbo(rho;c0=100,gamma=7,rho0=1000)
b=(c0^2*rho0)/gamma
P = @turbo b*((rho/rho0).^gamma .- 1)
end
EOS1_turbo (generic function with 1 method)
julia> EOS1(rho) ≈ EOS1_turbo(rho)
true
julia> @btime EOS1($rho);
68.043 μs (3 allocations: 23.81 KiB)
julia> @btime EOS1_turbo($rho);
1.809 μs (3 allocations: 23.81 KiB)
If this will be called from within a hot loop, you probably want to preallocate b.
edit:
If you use @. to have better chances of not forgetting some loop fusion, that gets even better:
julia> function EOS1_turbo(rho;c0=100,gamma=7,rho0=1000)
b=(c0^2*rho0)/gamma
P = @turbo @. b*((rho/rho0)^gamma - 1)
end
EOS1_turbo (generic function with 1 method)
julia> @btime EOS1_turbo($rho);
701.667 ns (1 allocation: 7.94 KiB)
(the reason one would notice that that was needed is that there were 3 allocations in the previous version, where only P should be a new allocation there. Thus, some . is missing and an intermediate vector is being generated in the previous version, which should be: P = @turbo b .* ((rho ./ rho0).^gamma .- 1).
If you have to call that same function thousand of times inside a loop, you will like to avoid creating a new P every time. Thus, create a variant of the function that receives a preallocated P and updates it.
@turbo Loop-vectorizes the operation, meaning that uses the capacity of the processors to perform more than one operation at once (search for SIMD or loop vectorization in google). When the operations are amenable to that, it is very good (sometimes that happens automatically, even without the @turbo flag, and sometimes you get a good speedup with @simd).
Now, how exactly in this specific case the speedup is that great, I think some expert should kick in. Normally I would expect something of the order of 4x to 8x speedup.
I also tried using FastPow and pow(x,n), but unfortunately the first one I could not get to work as you mentioned Leandro and the other one made it slower with @turbo @.
Sorry for interrupting a stupid question, if I do not use @ turbo trick for those .+ .- .* ./ .^2 etc stuff, will the speed be slow down a lot?
if I do not use these tricks, by default, what is Julia’s optimization level? like equivalent to Fortran’s O2 or something?
Julia and LLVM both do loop unrolling and SIMD vectorization to a certain extent, but @turbo by default takes more knowledge about the current CPU into account. It’s also much more aggressive and (with wrong use) can be unsafe. How much @turbo will help in that regard is not really possible to pin down to a specific number (it depends on the loop in question) - it’s just that through its aggressive nature, you can usually see a large speedup compared to base julia and its LLVM passes (as far as I know, mostly because @turbo requires independent loop iterations, whereas both LLVM and julia would have to prove that on their own and can’t just assume it).
While julia itself does have “optimization levels” as a command line flag, they pretty much don’t do anything nowadays as far as I’m aware.
In this particular case I would bet that turbo is much more agressive than any Fortran compiler. You can try it. AFAIK you can get more agressive loop optimizations with --march=native in Fortran. (you have asked that in the Fortran forum, by the way)
And @DNF ,does that mean I should always use floats for this? I.e. 1000.0^7?
Or how to interpret this result.
EDIT: Actually it is not as such a problem for my equation, since it just does (rho/rho0)^7, so no overflow. But it means I cannot get the speed up from calculating 1/rho0^7 first I suppose.