Any suggestions for speeding this simple scalar function up?

You sure?

julia> @btime ρ0 ^ γ setup = (ρ0 = 1000.0; γ = 7)
  85.486 ns (0 allocations: 0 bytes)
1.0e21

julia> @btime ρ0 ^ γ setup = (ρ0 = 1000.0; γ = 7.0)
  85.442 ns (0 allocations: 0 bytes)
1.0e21

Maybe that’s true when the exponent is a literal, or a small (< 4) integer, but if you’re using a variable that is larger than 3, the exponent is converted to a float anyway

Ehr…

2 Likes

Is it with @turbo?

I’m talking about Base Julia, in the example above there is no @turbo

1 Like

Just because the begining of this discussion was 50x speedup by using it. Here

Yes, you can do

(1/rho0)^7

but I don’t recommend this anymore, I think it’s bad numerically, even if it doesn’t fail catastrophically.

1 Like

In my example above there was no @turbo. With @turbo:

julia> @btime @turbo(ρ0 ^ γ) setup = (ρ0 = 1000.0; γ = 7.0)
  85.443 ns (0 allocations: 0 bytes)
1.0e21

julia> @btime @turbo(ρ0 ^ γ) setup = (ρ0 = 1000.0; γ = 7)
  85.474 ns (0 allocations: 0 bytes)
1.0e21

No difference at all between the two versions, and exactly same timing as without @turbo (not sure @turbo does anything at all here, though)

It makes a huge difference for vectors, just look up thread.

Here’s another example:

2 Likes

I’m on my phone, and can’t check. It’s weird that accuracy crashed and burned exactly for ^7, and was suddenly off by 8 orders of magnitude, but was ok for ^6, so I assumed overflow.

I believe we’re getting lost in translation. I was talking about difference between a ^ b when b is an integer or a float (and b is not an integer literal). @lmiq asked about LoopVectorization when I was talking about the Base function ^. Now you’re comparing pow(x, 7) (which is different from x ^ 7) for vectors, between Base.map and LoopVectorization.vmap, which doesn’t have anything to do with what I was talking about above?

But then I think it’s you who are a bit off-topic.

2 Likes

How so? :confused: I pointed out that ρ0 ^ γ overflows when ρ0 = 1000 and γ = 7, and there is no performance benefit in using integer values in this case (at least in Base)

While off topic, why is this threshold so low? Looks like it was added in #20889 but I don’t quite see where 3 was argued for.

julia> @btime y .= x .^ 3  setup=(x=rand(1000); y=similar(x));
  176.471 ns (0 allocations: 0 bytes)

julia> @btime y .= x .^ 4  setup=(x=rand(1000); y=similar(x));
  16.958 μs (0 allocations: 0 bytes)

julia> pow4(x) = begin z = x * x; z * z end

julia> @btime y .= pow4.(x)  setup=(x=rand(1000); y=similar(x));
  157.401 ns (0 allocations: 0 bytes)

Edit with some LV comparison:

julia> @btime @turbo(y .= x .^ 4)  setup=(x=rand(1000); y=similar(x));
  327.248 ns (0 allocations: 0 bytes)

julia> @btime @turbo(y .= x .^ 4.0)  setup=(x=rand(1000); y=similar(x));
  8.875 μs (0 allocations: 0 bytes)

# and x^7, similar

julia> @btime @turbo(y .= x .^ 7)  setup=(x=rand(1000); y=similar(x));
  240.423 ns (0 allocations: 0 bytes)

julia> pow7(x) = begin x3 = x^3; x * x3^2 end;

julia> @btime y .= pow7.(x)  setup=(x=rand(1000); y=similar(x));
  213.394 ns (0 allocations: 0 bytes)

I said so because everyone else was talking about LoopVectorization, and you came in saying that this was was not relevant, and insisted on talking about the Base versions.

The threshold is low because of numerical accuracy. x^3 is the last point when the accuracy of repeated multiplication is less than 1.5 ULP.

3 Likes

If you do this in a loop, LoopVectorization will automatically do this for you:

using LoopVectorization, BenchmarkTools
function pow7turbo!(y, x)
    @turbo for i in eachindex(y,x)
        y[i] = x[i]^7
    end
end
pow7(x) = begin x3 = x^3; x * x3^2 end;
@btime pow7turbo!(y, x)  setup=(x=rand(1000); y=similar(x));
@btime y .= pow7.(x)  setup=(x=rand(1000); y=similar(x));

I get:

julia> @btime pow7turbo!(y, x)  setup=(x=rand(1000); y=similar(x));
  111.771 ns (0 allocations: 0 bytes)

julia> @btime y .= pow7.(x)  setup=(x=rand(1000); y=similar(x));
  117.757 ns (0 allocations: 0 bytes)

I guess I could apply the same transform earlier, at a time that it’d also apply to broadcast statements.
This happens at macro-expand time given literals, otherwise LoopVectorization currently can’t read the constant values.

1 Like

Ah thanks, that makes sense.

After Oscar’s comment I realised that fastmath also seems to make this transformation? But also doesn’t seem to digest the broadcast:

julia> @btime y .= Base.FastMath.pow_fast.(x, Val(7))  setup=(x=rand(1000); y=similar(x));
  163.589 ns (0 allocations: 0 bytes)

julia> @macroexpand @fastmath(y .= x .^ 7)
:(y .= x .^ 7)

julia> @macroexpand @fastmath(y = x ^ 7)
:(y = Base.FastMath.pow_fast(x, Val{7}()))

For expressions like this, the first thing I would strive for (before speed) is numerical stability, eg by transforming to logs, such as

logP = logb + LogExpFunctions.logsubexp(gamma * (logrho - logrho0), 0)

and if possible staying in log space for the whole calculation. This occasionally improves speed to, but I did not benchmark for this particular problem.

3 Likes