How do I make this allocation free?

Ahmed_Salih · January 21, 2023, 11:48am

Hello!

Suppose I have the parameters:

using StaticArrays
using LinearAlgebra

h         = 0.05
eta2 = (0.01)h * (0.01)h
points = rand(SVector{3,Float64},100000)
v         = points * 0.05

And I wish to evaluate this expression:

 maximum(abs.(h* dot.(v,points) ./ (dot.(points,points) .+ eta2)))

Which currently gives me a timing of:

@btime visc  = maximum(abs.($h* dot.($v,$points) ./ (dot.($points,$points) .+  $eta2)))
  442.800 μs (6 allocations: 2.29 MiB)
0.0024999997855364884

The speed is really good, but I just don’t understand why it is allocating 2 mb and how I avoid that?

Kind regards

Raf · January 21, 2023, 12:01pm

your missing a . on the first *.

Ahmed_Salih · January 21, 2023, 12:06pm

Thank you, this reduces it to:

@btime visc  = maximum(abs.($h .* dot.($v,$points) ./ (dot.($points,$points) .+  $eta2)))
  273.000 μs (2 allocations: 781.30 KiB)
0.0024999997855364884

So a third. Anything else I perhaps might be missing?

Ahmed_Salih · January 21, 2023, 12:09pm

I just don’t understand why this would allocate at all since it reduces down to one scalar value?

Is there some intermediate array which is not being cleaned up properly?

jules · January 21, 2023, 12:09pm

Use the form of maximum that takes a function as the first argument, and / or make a generator expression out of the broadcast because that needs to create an array right now

Ahmed_Salih · January 21, 2023, 12:26pm

Sorry, I don’t understand. I see the syntax:

maximum(abs2, A, dims=1)

But I can’t see how to convert my example into that. I tried:

f() = abs.(h .* dot.(v,points) ./ (dot.(points,points) .+  eta2))

And then maximum of that, but that did the same of course

Could you explain more please?

Kind regards

jules · January 21, 2023, 12:54pm

You’re iterating over points and v but for a maximum to be determined, not all of the results need to be stored in a vector at the same time. It is enough to compute one by one. So either you make a function to call on each element of the collection you’re interested in, or you use a generator expression which also only ever computes one element at a time. As you have two collections I use zip here, you’d do the same for the functional version a la maximum(f, zip(points, v)):

julia> @btime maximum(abs($h * dot(_v, p) / (dot(p, p) + $eta2)) for (p, _v) in zip($points, $v))
  2.237 ms (0 allocations: 0 bytes)
0.002499999788221453

The functional version:

julia> @btime maximum(zip($points, $v)) do (p, _v)
           abs($h * dot(_v, p) / (dot(p, p) + $eta2))
       end
  2.173 ms (0 allocations: 0 bytes)
0.002499999788221453

nsajko · January 21, 2023, 1:13pm

If you use mapreduce instead of maximum, then you don’t have to use zip. Something like this:

@btime mapreduce((p, _v) -> abs($h * dot(_v, p) / (dot(p, p) + $eta2)), max, $points, $v)

BTW, some obvious performance suggesions:

Always use functions: the first suggestion in the performance tips: Performance Tips · The Julia Language
Try to provide the init keyword argument to mapreduce and related functions.

Ahmed_Salih · January 21, 2023, 3:44pm

Thank you @jules and @nsajko !

Benchmarking both approaches:

 @btime maximum(zip($points, $v)) do (p, _v)
                  abs($h * dot(_v, p) / (dot(p, p) + $eta2))
                         end
  1.323 ms (0 allocations: 0 bytes)
0.002499999786538688

@btime mapreduce((p, _v) -> abs($h * dot(_v, p) / (dot(p, p) + $eta2)), max, $points, $v)
  1.296 ms (2 allocations: 781.30 KiB)
0.002499999786538688

The zip approach still seems to work better, perhaps something can be improved in the map-reduce?

Neat knowing both ways though!

Both ways are about 6 times slower though:

@btime visc  = maximum(abs.($h .* dot.($v,$points) ./ (dot.($points,$points) .+  $eta2)))
  286.900 μs (2 allocations: 781.30 KiB)
0.002499999786538688

And I am not sure why Could someone elaborate for me?

Kind regards

Ahmed_Salih · January 21, 2023, 3:52pm

By the way, thank you so much for showing me this approach!

It gives me a completely allocation free time stepping function since I am using the principle you just showed me:

And that seems to give faster end result, even if the benchmark test is slower - a bit confusing for me, but really happy for you and @nsajko taking the time to show me how one could approach it

Kind regards

nsajko · January 21, 2023, 4:04pm

Benchmarking like that is kind of pointless. The proper procedure would be organize your code into functions and then benchmark like so:

@benchmark your_function(arg1, arg2, arg3) setup=(arg1 = INIT1; arg2 = INIT2; arg3 = INIT3;)

We can’t organize your code into functions for you, because:

We’d need to have access to your entire code to understand the choices and find best design.
It’d be too much work for helping someone I don’t know

If you have some more specific questions, though, feel free to open new threads here in this forum.

lmiq · January 21, 2023, 5:20pm

You can always just write the straight loop:

julia> function maxvp(v,points,h,eta2)
           maxval = -Inf
           for i in eachindex(v,points)
                maxval = max(maxval, abs(h * dot(v[i],points[i]) / (dot(points[i],points[i]) + eta2)))
           end
           return maxval
       end
maxvp (generic function with 1 method)

julia> @btime maxvp($v,$points,$h,$eta2)
  351.774 μs (0 allocations: 0 bytes)
0.002499999787200059

If you add @fastmath to maxval = ..., time reduces by roughly 2.

Ahmed_Salih · January 21, 2023, 5:22pm

True, I tried that but it became way to slow, since I used an “if branch” instead of what you are doing with max

Thank you for showing it to me!

Kind regards

lmiq · January 21, 2023, 5:26pm

That does not really makes sense. Probably is something not being benchmarked correctly.

ademonts · January 21, 2023, 5:34pm

You can get another bit of speed like:

julia> function maxvp_2(v,points,h,eta2)
          maxval = -Inf
          for i in eachindex(v,points)
               tmp = abs(h * dot(v[i],points[i]) / (dot(points[i],points[i]) + eta2))
               tmp > maxval && (maxval = tmp)
          end
          return maxval
       end
maxvp_2 (generic function with 1 method)

julia> @btime maxvp($v,$points,$h,$eta2)
  244.788 μs (0 allocations: 0 bytes)
0.002499999783527899

julia> @btime maxvp_2($v,$points,$h,$eta2)
  136.531 μs (0 allocations: 0 bytes)
0.002499999783527899

Ahmed_Salih · January 21, 2023, 5:46pm

ademonts:

function maxvp_2(v,points,h,eta2)
          maxval = -Inf
          for i in eachindex(v,points)
               tmp = abs(h * dot(v[i],points[i]) / (dot(points[i],points[i]) + eta2))
               tmp > maxval && (maxval = tmp)
          end
          return maxval
       end

Thank you!

That did give an extra boost yes Of course now it is getting down to extremely small performance benefits, so probably best to stop improving it now hehe

rocco_sprmnt21 · January 21, 2023, 7:59pm

you can also try this

julia>  maxvp_3(v,points,h,eta2)=0.00249999978
maxvp_3 (generic function with 1 method)

julia> @btime maxvp_3($v,$points,$h,$eta2)
  1.100 ns (0 allocations: 0 bytes)
0.00249999978

this isn’t fast, but it doesn’t allocate

julia> @btime maximum(abs, Iterators.map((e,p)-> ($h* dot(e,p) / (dot(p,p) + $eta2)),$v,$points))
  1.399 ms (0 allocations: 0 bytes)
0.0024999997831612647

uniment · January 21, 2023, 11:40pm

Seems related:

TLDR: performance of maximum has a lot of room for improvement, and there is an ongoing effort to fix it.

Ahmed_Salih · January 22, 2023, 6:45am

Thank you for making me aware. Indeed seems to be the case that the most straight forward approach from a function perspective is not yet developed enough for performance

Kind regards