Moving from `Float64` to `Float32` not improving performance

Hi, I’m trying to optimize a script. I first did the usual stuff to avoid memory allocation, but when moving all data structures from Float64 to Float32, which should result in a reduced memory usage, the code has become super slow.

I’m still trying to get my head around this, but I was wondering if anyone could help me spot where my mistakes are.

  • Here’s the Float64 version. Benchmarked time is 0.392s.
    Here’s the Float32 version. The simulation takes forever, hinting at an error during the code conversion.

Here’s an overview of the Float64 script to be optimized:

using Revise, BenchmarkTools
using ProgressMeter, Infiltrator
using Plots

const sec_in_day = 24.0 * 60.0 * 60.0
const sec_in_year = sec_in_day * 365.0
const glen_n = 3.0
const ρ = 900.0
const g = 9.81

@views diff1(A) = (A[2:end] .- A[1:end - 1])
@views diff2(A) = (A[3:end] .- A[1:end - 2])
@views avg(A) = (A[2:end] .+ A[1:end - 1])./2.0

function glacier_evolution_optim(;
    dx=100.0,  # grid resolution in m
    nx=200,  # grid size
    width=600.0,  # glacier width in m
    top_h=3000.0,  # bed top altitude
    bottom_h=1200.0,  # bed bottom altitude
    glen_a=2.4e-24,  # ice stiffness
    ela_h=2600.0,  # mass balance model Equilibrium Line Altitude
    mb_grad=3.0,  # linear mass balance gradient (unit: [mm w.e. yr-1 m-1])
    n_years=700  # simulation time in years
)
    function get_mb(heights)
        mb = (heights .- ela_h) .* mb_grad
        return mb ./ sec_in_year ./ ρ
    end

    let 
    bed_h = collect(LinRange(top_h, bottom_h, nx))
    surface_h = copy(bed_h)
    thick = bed_h .* 0.0

    t = 0.0
    dt = sec_in_day * 10.0

    years = collect(0:(n_years+1))
    volume = zeros(size(years))
    length = zeros(size(years))

    new_thick = zeros(nx)
    diffusivity = zeros(nx)
    diffusivity_s = zeros(nx)
    surface_gradient = zeros(nx)
    surface_gradient_s = zeros(nx)
    grad_x_diff = zeros(nx)
    flux_div = zeros(nx-1)
    mb = zeros(nx-1)

    for (i, y) in enumerate(years)
        let end_t = y * sec_in_year
        # Time integration
        while t < end_t
            # This is to guarantee a precise arrival on a specific date if asked
            remaining_t = end_t - t
            if remaining_t < dt
                dt = remaining_t
            end

            # Surface gradient
            surface_gradient[2:end-1] .= diff2(surface_h) ./ (2.0*dx)

            # Diffusivity
            diffusivity .= width * ((ρ*g)^3.0) .* (thick.^3.0) .* surface_gradient.^2.0
            diffusivity .*= 2.0/(glen_a+2.0) * glen_a .* thick.^2.0

            # Ice flux in a staggered grid
            diffusivity_s[2:end] .= avg(diffusivity)

            surface_gradient_s[2:end] .= diff1(surface_h) ./ dx

            grad_x_diff .= surface_gradient_s .* diffusivity_s
            flux_div .= diff1(grad_x_diff) ./ dx

            # Mass balance
            mb .= get_mb(surface_h[begin:end-1])

            # Ice thickness update: old + flux div + mb
            new_thick[begin:end-1] .= thick[begin:end-1] .+ (dt/width) .* flux_div .+ dt.*mb

            # We can have negative thickness because of MB - correct here
            thick .= ifelse.(new_thick.<0.0, 0.0, new_thick)
            
            @assert thick[end] == 0.0 "Glacier exceeding boundaries! at time $(t/sec_in_year)"

            # Prepare for next step 
            surface_h .= bed_h .+ thick
            t += dt
        end
        end # let

        volume[i] = sum(thick .* width .* dx)
        length[i] = sum(thick .> 0.0) .* dx
    end

    # xcoordinates
    xc = collect(0:nx-1) .* dx

    return xc, bed_h, surface_h, years, volume, length
    end # let
    
end

function wrapper_grad(grad)
    return glacier_evolution_optim(mb_grad=grad)
end


#######  MAIN ########

# Let's test the performance
@btime xc, bed_h, surface_h, years, volume, length = glacier_evolution_optim()

Thanks in advance!

If you do not run this on a GPU you will not gain anything much from Float32. See e.g. these discussion about it Why no Float type alias?

I just took a glance into your code and stumbled over:

  • Why is function get_mb(heights) local in function function glacier_evolution_optim. Why not just a global function with additional parameters ela_h and mb_grad ?

  • And I see those collect`s like

years = collect(0:(n_years+1))

Later you do

for (i, y) in enumerate(years)

but you don’t use i (perhaps I missed it?), so perhaps just a

years = 0:(n_years+1)
for y in years
...

will do.

  • The purpose of those let 's is not clear to me. Perhaps you developed those parts in the REPL where you needed them for scope issues. You don’t need them in the function body.

I don’t know if those issues help for performance, I didn’t try to run your code. But cleaning it up from not needed structures will help a lot to get people into the maximize-performance-game.

2 Likes

It’s not clear where you are using Float32, but all your literal floats are Float64, everywhere you write 24.0 or 2.0 it’s a Float64, so you may get lots of type promotions or even instabilities.

Also, it’s better not to use floats for integer concepts. Replace

with

const sec_in_day = 24 * 60 * 60

You can see it here:

julia> x = 3f0  # Float32
3.0f0

julia> typeof(x * 2)  # preserves type
Float32

julia> typeof(x * 2.0)  # promotes to Float64
Float64

Multiplying with 2.0 promotes your variable to Float64, multiply with 2, and the float type is preserved.

This:

diffusivity .= width * ((ρ*g)^3.0) .* (thick.^3.0) .* surface_gradient.^2.0
diffusivity .*= 2.0/(glen_a+2.0) * glen_a .* thick.^2.0

should be replaced with:

diffusivity .= width * ((ρ*g)^3) .* (thick.^3) .* surface_gradient.^2
diffusivity .*= 2/(glen_a+2) * glen_a .* thick.^2

Not just to avoid promotion, but because x^2 is much faster than x^2.0. The former is just x*x, and x^3 is x*x*x, while the latter needs to run code that calculates general floating point exponents.

Your loop over years seems to allocate a lot of temporary arrays. Can that be avoided? For example:

could be changed to

volume[i] = dot(thick, width) * dx
length[i] = count(>(0), thick) * dx
5 Likes

Thanks for the feedback.

  • Is there really a performance gain by moving a function to the global scope?
  • In the for loop, I do use the i in enumerate(years). So I can’t really remove that.
  • The let block is to be able to access all those variables in the for and while loops scope. These introduce new local scopes, so without it I cannot access them.

To be honest, I’m not terribly worried about pushing the performance much further. I was mostly curious to understand why I wasn’t getting any performance gains by moving to Float32. After reading the discussion there it is still unclear to me why there shouldn’t be any benefits in using Float32. Shouldn’t it halve the memory allocation?

You’re looking at the Float64 script. I gave direct links to the Float32 and Float64 scripts. In the Float32 I’m using Float32 types throughout the script as you mentioned.

I didn’t know about the Int exponents, thanks for that. I’ll aslo try the trick for volume and length.

What is important is the number of allocations, not the amount of memory used. The slowdown comes from having to create blocks of memory to store data, and that does not change (much?) with the type of variable.

You can use enumerate without collecting the range.

If you want to accelerate the code, you need to get rid of temporary allocations. For example, this function:

It returns a new array, and that makes the line where you use it allocating:


julia> @views diff2(A) = (A[3:end] .- A[1:end - 2])
diff2 (generic function with 3 methods)

julia> f(surface_gradient,surface_h,dx) = surface_gradient[2:end-1] .= diff2(surface_h) ./ (2.0*dx)
f (generic function with 2 methods)

julia> @btime f($surface_gradient, $surface_h, 0.1)
  36.695 ns (1 allocation: 128 bytes)

Now if you want that to be faster, you could update the surface_gradient differently, without creating the intermediate difference array:

julia> new_diff2!(surface_gradient, surface_h, dx) = @views @. surface_gradient[2:end-1] = (surface_h[3:end] - surface_h[1:end-2])/(2*dx)
new_diff2! (generic function with 1 method)

julia> @btime new_diff2!($surface_gradient, $surface_h, 0.1)
  18.282 ns (0 allocations: 0 bytes)

(I’m just using random arrays of length 10 here). If you solved these types of allocations you may obtain a much faster code.

Float32s can make the computations slightly faster at the end, but that will probably be very minor relative to solving these other issues.

2 Likes

Note this this allocates an extra temporary array for the result of diff2, because dot operations inside the function call are not merged with the dot operation outside the function call. Much better to do (with @views somewhere, e.g. on the whole function:

surface_gradient[2:end-1] .= (A[3:end] .- A[1:end-2]) ./ (2dx)

Note that it’s not necessary to write 2.0*dx2dx will automatically promote the result of the multiplication to the type of dx. Or even:

surface_gradient[2:end-1] .= (A[3:end] .- A[1:end-2]) .* eltype(A)(1/2dx)

so that the division is done outside the loop and is converted to the element type of A (e.g. Float64)

As another example:

has multiple problems.

  1. Floating-point exponentials like x^3.0 are much slower to compute than integer exponents like x^3. The compiler doesn’t look at the value of 3.0, it just looks at the type and sees x^(somefloat), and calls a generic exponentiation routine that handles arbitrary non-integer exponents.
  2. The width * is a non-dotted operator that allocates a new array for its result (when scaling an array)
  3. By splitting this into two assignment statements, you have two loops instead of one.

Better to do something like:

diffusivity .= (width * (ρ*g)^3) .* thick.^3 .* surface_gradient.^2 .*
               (2glen_a/(glen_a+2)) .* thick.^2

Notice that I’m avoiding dots for the computation of purely scalar coefficients like (2glen_a/(glen_a+2)) — this way, the scalar computation is done outside the vectorized loop and the result is dropped into the vectorized loop over the arrays.

There are many other similar problems. In general, it looks like you really need to understand (a) Julia’s type-promotion rules and (b) what the dot vectorization syntax does and does not do.

8 Likes

Thanks a lot for all this information. I’m aware I still have some important concepts to grasp correctly, but I also feel there are some grey areas which are not too clear to me.

@lmiq Thanks a lot for the suggestion. I updated the implementation of the functions and it helped boost the performance further.

@stevengj Thanks a lot for all the details. I’m a little bit confused with your comments on dot operations. I thought I was using those in the way you mention (i.e. no broadcasting for scalars, just when arrays are involved).

In your re-writing of this:

diffusivity .= (ρ*g)^3 .* thick.^3 .* surface_gradient.^2 .*
               (2glen_a/(glen_a+2)) .* thick^2

There are a few things I don’t understand. Why are you using .^ in thick at the beginning but ommitting it in the second line? It is always an array. I understand that the way it was written before it could be improved in two ways: (1) using Int in the exponents, and (2) writing everything in a single line. But I’m was under the impression that I was already using broadcasting only when arrays were involved. Am I missing something?

Thanks again for all your feedback.

Typo, sorry. It’s now fixed.

width * (...) in your code multiplies a scalar width times an array. While this is a valid operation, because it is not “dotted” it generates a new temporary array as a result, rather than “fusing” the scalar multiplication with the other dot operations into a single allocation-free loop.

2.0/(glen_a+2.0) * glen_a .* thick.^2.0 is okay because * in Julia left-associative, so the scalar operations at left are done first, before the .*. But relying on associativity like this can be a bit fragile because it is vulnerable to slight code modifications. It’s safer to write (2glen_a / (glen_a+2)) in parentheses to make it clear that it’s supposed to be calculated separately, in my opinion.

2 Likes

Hmmm, in this line:

diffusivity .= width * ((ρ*g)^3) .* (thick.^3) .* surface_gradient.^2

I was following this logic: width, ρ and g are scalars, so all of them interact with non-dotted operators. However, the interface with that scalar block (i.e. width * ((ρ*g)^3.0)) with an array, uses dotted operators. Shouldn’t that be correct as well, following Julia’s left-association?

That line is fine:

julia> f(diffusivity, width, ρ, g, thick, surface_gradient) = diffusivity .= width * ((ρ*g)^3) .* (thick.^3) .* surface_gradient.^2
f (generic function with 1 method)

julia> diffusivity = rand(10); thick=rand(10); surface_gradient=rand(10);

julia> @btime f($diffusivity, 0.1, 0.1, 0.1, $thick, $surface_gradient)
  9.385 ns (0 allocations: 0 bytes)

IMHO, independently on how much one internalizes the rules, having the habit of testing these code snippets, like I did there, is what will make you confident that you are doing the best on each line.

2 Likes

I completely agree here. That’s what I normally do. But I must admit that I often come across counter intuitive behaviours. For example, some of you suggested merging these two lines in a single line in order to avoid having two loops:

diffusivity .= width * ((ρ*g)^3.0) .* (thick.^3.0) .* surface_gradient.^2.0
diffusivity .*= 2.0/(glen_a+2.0) * glen_a .* thick.^2.0

However, the benchmark indicates that it is actually faster to keep the code like this in two lines. I sometimes struggle to find a reasonable explanation to some practical performance results.

note that as of Julia 1.8, x^y has a fast path for when y is an integer. it’s still better to use integer exponents directly, but it’s not as bad as it used to be.

1 Like

If you manage to write correctly the dots, the single line may be slightly faster, because you will be saving a loop. But that dot fusion syntax can become pretty hard to get right.

I tend to avoid these too complicated dot combinations, because of that, but it is a matter of taste. Sometimes writing the loop is just simpler:

julia> function f0(diffusivity, width, ρ, g, thick, surface_gradient, glen_a)
           diffusivity .= width * ((ρ*g)^3.0) .* (thick.^3.0) .* surface_gradient.^2.0
           diffusivity .*= 2.0/(glen_a+2.0) * glen_a .* thick.^2.0
           return diffusivity
       end
f0 (generic function with 1 method)

julia> diffusivity = rand(10); thick=rand(10); surface_gradient=rand(10);

julia> @btime f0($diffusivity, 1.0, 1.0, 1.0, $thick, $surface_gradient, 1.0);
  75.667 ns (0 allocations: 0 bytes)

julia> function f1(diffusivity, width, ρ, g, thick, surface_gradient, glen_a)
           for i in eachindex(diffusivity, thick, surface_gradient)
               diffusivity[i] = (width * ((ρ*g)^3.0) * (thick[i]^3.0) * surface_gradient[i]^2.0) *
                                (2.0/(glen_a+2.0) * glen_a * thick[i]^2.0)
           end
           return diffusivity
       end

f1 (generic function with 1 method)

julia> @btime f1($diffusivity, 1.0, 1.0, 1.0, $thick, $surface_gradient, 1.0);
  70.115 ns (0 allocations: 0 bytes)

ps: By fixing the exponents to integers, this is what we get:

julia> function f1(diffusivity, width, ρ, g, thick, surface_gradient, glen_a)
           for i in eachindex(diffusivity, thick, surface_gradient)
               diffusivity[i] = (width * ((ρ*g)^3) * (thick[i]^3) * surface_gradient[i]^2) *
                                          (2/(glen_a+2) * glen_a * thick[i]^2)
           end
           return diffusivity
       end
f1 (generic function with 1 method)

julia> @btime f1($diffusivity, 1.0, 1.0, 1.0, $thick, $surface_gradient, 1.0);
  9.594 ns (0 allocations: 0 bytes)
1 Like

I don’t think that’s true (in general). Yes, on GPUs Float32 is going to be a lot faster, and again Float16 (blfoat16 etc.), because you have more functional units for those (and better use of memory [bandwidth]).

I believe that should not happen, on CPUs (or GPUs), it means you are doing something wrong, likely introduced a type-instability (noticeable by e.g. extra memory allocations). If you do it right I think at worst if should be as fast (though less precise, not always a problem, but if you need to work around that, then could get slower, but you haven’t tried that yet).

One exception I can think of where Float32 would make slower, is if you converge slower, might apply for you, but I would rule out other issues people have already pointed out first.

At best Float32 will give you 2x speedup on CPUs (and 4x for Float16, while often not achievable), because of memory bandwidth, right?

I see in your code, the first needless line:

# -*- coding: utf-8 -*-

That isn’t needed/wanted in Julia, nor even in Python 3. Julia always uses UTF-8 for source code (and this is just a comment in Julia), and Python 3 defaults to it (do people ever use anything else by now? You could use such a special comment in Python for non-UTF-8,or for UTF-8 in outdated Python 2.

The above works ok, but I suggest instead:

diffusivity .= (width * (ρ*g)^3) .* (thick.^3) .* surface_gradient.^2

It’s clearer that (width * (ρ*g)^3) is a group.

Concerning the original question, F32 can accelerate, if using LoopVectorization:

julia> using LoopVectorization

julia> function f1(diffusivity, width, ρ, g, thick, surface_gradient, glen_a)
           @turbo for i in eachindex(diffusivity, thick, surface_gradient)
               diffusivity[i] = (width * ((ρ*g)^3) * (thick[i]^3) * surface_gradient[i]^2) *
                                          (2/(glen_a+2) * glen_a * thick[i]^2)
           end
           return diffusivity
       end
f1 (generic function with 1 method)

julia> diffusivity = rand(10); thick=rand(10); surface_gradient=rand(10);

julia> @btime f1($diffusivity, 1.0, 1.0, 1.0, $thick, $surface_gradient, 1.0);
  11.353 ns (0 allocations: 0 bytes)

julia> diffusivity = rand(Float32,10); thick=rand(Float32,10); surface_gradient=rand(Float32,10);

julia> @btime f1($diffusivity, 1.0f0, 1.0f0, 1.0f0, $thick, $surface_gradient, 1.0f0);
  5.362 ns (0 allocations: 0 bytes)

But it is not about memory usage, but how many operations the processors can do simultaneously.

I think it would have made more sense to share the Float32 script in the OP instead.

Also, the link to the Float32 is broken.

Oh, I was mis-reading the parentheses. Yes, the width * ((ρ*g)^3) multiplication will be done first as a scalar, and then multiplied by the array. Sorry.

(But I still think it’s clearer to explicitly group scalar operations with parentheses rather than relying on associativity. e.g. if you merge the two lines then the associativity may get changed if you are not careful.)

1 Like

I can’t reproduce:

function f1(diffusivity, width, ρ, g, thick, surface_gradient, glen_a)
    diffusivity .= width * ((ρ*g)^3) .* (thick.^3) .* surface_gradient.^2
    diffusivity .*= 2.0/(glen_a+2.0) * glen_a .* thick.^2
end

function f2(diffusivity, width, ρ, g, thick, surface_gradient, glen_a)
    diffusivity .= (width * (ρ*g)^3) .* thick.^3 .* surface_gradient.^2 .*
                   (2glen_a/(glen_a+2)) .* thick.^2
end

using BenchmarkTools
for n in (10, 100, 1000, 10000)
    @show n
    @btime f1($(rand(n)), 1.0, 1.0, 1.0, $(rand(n)), $(rand(n)), 1.0)
    @btime f2($(rand(n)), 1.0, 1.0, 1.0, $(rand(n)), $(rand(n)), 1.0)
end

gives

n = 10
  12.929 ns (0 allocations: 0 bytes)
  11.053 ns (0 allocations: 0 bytes)
n = 100
  44.276 ns (0 allocations: 0 bytes)
  38.936 ns (0 allocations: 0 bytes)
n = 1000
  375.607 ns (0 allocations: 0 bytes)
  322.017 ns (0 allocations: 0 bytes)
n = 10000
  4.494 μs (0 allocations: 0 bytes)
  3.188 μs (0 allocations: 0 bytes)

i.e. f2 is consistently faster, especially as the array gets larger and cache locality becomes an issue.