Any suggestions for speeding this simple scalar function up?

giordano · August 21, 2021, 10:41am

You sure?

julia> @btime ρ0 ^ γ setup = (ρ0 = 1000.0; γ = 7)
  85.486 ns (0 allocations: 0 bytes)
1.0e21

julia> @btime ρ0 ^ γ setup = (ρ0 = 1000.0; γ = 7.0)
  85.442 ns (0 allocations: 0 bytes)
1.0e21

Maybe that’s true when the exponent is a literal, or a small (< 4) integer, but if you’re using a variable that is larger than 3, the exponent is converted to a float anyway

Ehr…

lmiq · August 21, 2021, 10:52am

Is it with @turbo?

giordano · August 21, 2021, 10:53am

I’m talking about Base Julia, in the example above there is no @turbo

lmiq · August 21, 2021, 10:56am

Just because the begining of this discussion was 50x speedup by using it. Here

DNF · August 21, 2021, 10:58am

Yes, you can do

(1/rho0)^7

but I don’t recommend this anymore, I think it’s bad numerically, even if it doesn’t fail catastrophically.

giordano · August 21, 2021, 11:09am

In my example above there was no @turbo. With @turbo:

julia> @btime @turbo(ρ0 ^ γ) setup = (ρ0 = 1000.0; γ = 7.0)
  85.443 ns (0 allocations: 0 bytes)
1.0e21

julia> @btime @turbo(ρ0 ^ γ) setup = (ρ0 = 1000.0; γ = 7)
  85.474 ns (0 allocations: 0 bytes)
1.0e21

No difference at all between the two versions, and exactly same timing as without @turbo (not sure @turbo does anything at all here, though)

DNF · August 21, 2021, 12:10pm

It makes a huge difference for vectors, just look up thread.

Here’s another example:

stillyslalom:

julia> pow(x, n) = x^n
pow (generic function with 1 method)

julia> @btime Base.map(x -> pow(x, 7), x) setup=(x=rand(1000));
  30.900 μs (1 allocation: 7.94 KiB)

julia> @btime LoopVectorization.vmap(x -> pow(x, 7), x) setup=(x=rand(1000));
  897.872 ns (1 allocation: 7.94 KiB)

DNF · August 21, 2021, 12:13pm

I’m on my phone, and can’t check. It’s weird that accuracy crashed and burned exactly for ^7, and was suddenly off by 8 orders of magnitude, but was ok for ^6, so I assumed overflow.

giordano · August 21, 2021, 12:47pm

I believe we’re getting lost in translation. I was talking about difference between a ^ b when b is an integer or a float (and b is not an integer literal). @lmiq asked about LoopVectorization when I was talking about the Base function ^. Now you’re comparing pow(x, 7) (which is different from x ^ 7) for vectors, between Base.map and LoopVectorization.vmap, which doesn’t have anything to do with what I was talking about above?

DNF · August 21, 2021, 12:49pm

But then I think it’s you who are a bit off-topic.

giordano · August 21, 2021, 12:51pm

How so? I pointed out that ρ0 ^ γ overflows when ρ0 = 1000 and γ = 7, and there is no performance benefit in using integer values in this case (at least in Base)

mcabbott · August 21, 2021, 1:27pm

While off topic, why is this threshold so low? Looks like it was added in #20889 but I don’t quite see where 3 was argued for.

julia> @btime y .= x .^ 3  setup=(x=rand(1000); y=similar(x));
  176.471 ns (0 allocations: 0 bytes)

julia> @btime y .= x .^ 4  setup=(x=rand(1000); y=similar(x));
  16.958 μs (0 allocations: 0 bytes)

julia> pow4(x) = begin z = x * x; z * z end

julia> @btime y .= pow4.(x)  setup=(x=rand(1000); y=similar(x));
  157.401 ns (0 allocations: 0 bytes)

Edit with some LV comparison:

julia> @btime @turbo(y .= x .^ 4)  setup=(x=rand(1000); y=similar(x));
  327.248 ns (0 allocations: 0 bytes)

julia> @btime @turbo(y .= x .^ 4.0)  setup=(x=rand(1000); y=similar(x));
  8.875 μs (0 allocations: 0 bytes)

# and x^7, similar

julia> @btime @turbo(y .= x .^ 7)  setup=(x=rand(1000); y=similar(x));
  240.423 ns (0 allocations: 0 bytes)

julia> pow7(x) = begin x3 = x^3; x * x3^2 end;

julia> @btime y .= pow7.(x)  setup=(x=rand(1000); y=similar(x));
  213.394 ns (0 allocations: 0 bytes)

DNF · August 21, 2021, 1:36pm

I said so because everyone else was talking about LoopVectorization, and you came in saying that this was was not relevant, and insisted on talking about the Base versions.

Oscar_Smith · August 21, 2021, 5:50pm

The threshold is low because of numerical accuracy. x^3 is the last point when the accuracy of repeated multiplication is less than 1.5 ULP.

Elrod · August 22, 2021, 12:02am

If you do this in a loop, LoopVectorization will automatically do this for you:

using LoopVectorization, BenchmarkTools
function pow7turbo!(y, x)
    @turbo for i in eachindex(y,x)
        y[i] = x[i]^7
    end
end
pow7(x) = begin x3 = x^3; x * x3^2 end;
@btime pow7turbo!(y, x)  setup=(x=rand(1000); y=similar(x));
@btime y .= pow7.(x)  setup=(x=rand(1000); y=similar(x));

I get:

julia> @btime pow7turbo!(y, x)  setup=(x=rand(1000); y=similar(x));
  111.771 ns (0 allocations: 0 bytes)

julia> @btime y .= pow7.(x)  setup=(x=rand(1000); y=similar(x));
  117.757 ns (0 allocations: 0 bytes)

I guess I could apply the same transform earlier, at a time that it’d also apply to broadcast statements.
This happens at macro-expand time given literals, otherwise LoopVectorization currently can’t read the constant values.

mcabbott · August 22, 2021, 12:30am

Ah thanks, that makes sense.

After Oscar’s comment I realised that fastmath also seems to make this transformation? But also doesn’t seem to digest the broadcast:

julia> @btime y .= Base.FastMath.pow_fast.(x, Val(7))  setup=(x=rand(1000); y=similar(x));
  163.589 ns (0 allocations: 0 bytes)

julia> @macroexpand @fastmath(y .= x .^ 7)
:(y .= x .^ 7)

julia> @macroexpand @fastmath(y = x ^ 7)
:(y = Base.FastMath.pow_fast(x, Val{7}()))

Tamas_Papp · August 22, 2021, 2:35pm

For expressions like this, the first thing I would strive for (before speed) is numerical stability, eg by transforming to logs, such as

logP = logb + LogExpFunctions.logsubexp(gamma * (logrho - logrho0), 0)

and if possible staying in log space for the whole calculation. This occasionally improves speed to, but I did not benchmark for this particular problem.

Topic		Replies	Views
Can this be written even faster? (CPU) Performance	34	846	February 9, 2024
Need help on improving performance of a small subfunction Performance	21	892	May 4, 2020
Is there a smarter way to evaluate an float power? Performance	27	1641	February 15, 2023
Slow arbitrary base exponentiation, a^b Performance	22	2782	June 14, 2020
Arithmetic performance of expression Performance	11	371	October 4, 2022

Any suggestions for speeding this simple scalar function up?

Related topics