Loop over multi-dimensional array optimization

Hi,

I encountered a performance bottleneck for array operations. I tried two versions: dot fusion and explicit loop. To my surprise, not only does dot fusion take less time, it also takes less memory, which contradicts my previous understandings.

Here is the simplified working code:

x = range(0.0, 1.0, length=10000)
y = 0.5:0.5:1.5
z = 0.5:0.5:1.5
var = zeros(10000,3,3,3)

function foo(hx::StepRangeLen,hy::StepRangeLen,hz::StepRangeLen,
   var::Array{Float64,4})::Array{Float64,3}

   siz = size(var)::NTuple{4,Int64}

   px = zeros(Float64,siz[1:3]...)
   qy = zeros(Float64,siz[1:3]...)
   rz = zeros(Float64,siz[1:3]...)

   n = size(hx,1)

   if n > 2
      @time for k=1:siz[3], j=1:siz[2], i=1:n-2
         px[i+1,j,k] = (var[i+2,j,k,1] - var[i,j,k,1])/(hx[i+2] - hx[i])
      end
      @time @. px[2:n-1,:,:] = (var[3:n,:,:,1] - var[1:n-2,:,:,1])/(hx[3:n] - hx[1:n-2])
   end

   px
end

px = foo(x,y,z,var);

When I run it on my laptop the second time, I get:

px = foo(x,y,z,var);
  0.006217 seconds (175.37 k allocations: 2.676 MiB)
  0.001222 seconds (18 allocations: 1.374 MiB)

In my real code, I have many similar array operations in the kernel functions, so I really need good performance. I tried @warn_type on the prototype function foo, but it’s still hard to follow for me.

Any advice is appreciated!

Hi,

Would be nice that we can simply copy and paste your code and run it…
I do not know about the foo arguments.

  • For timing you should put @inbounds if you want to avoid to measure tests (or run julia with --check-bounds=no)
  • Also it is much more accurate to use @btime from BenchmarkTools
  • it is usually faster to use px[2:end-1,:,:] instead of px[2:n-1,:,:]
  • the Julia arrays are stored by column so x[:,2] is faster that x[2,:]

hope this helps

OK, I think that you did measure compilation time.

On the following form both expressions runs at the same speed.

348.870 μs (0 allocations: 0 bytes)
310.066 μs (5 allocations: 1.37 MiB)
0.000384 seconds (6 allocations: 352 bytes)
0.000350 seconds (11 allocations: 1.374 MiB)
0.000376 seconds (3 allocations: 128 bytes)
0.000347 seconds (8 allocations: 1.373 MiB)

using BenchmarkTools

function compute_px1(siz,px,n,hx,var)
   @inbounds for k=1:siz[3], j=1:siz[2], i=1:n-2
         px[i+1,j,k] = (var[i+2,j,k,1] - var[i,j,k,1])/(hx[i+2] - hx[i])
   end
end
compute_px2(siz,px,n,hx,var) = @inbounds @. px[2:n-1,:,:] = (var[3:n,:,:,1] - var[1:n-2,:,:,1])/(hx[3:n] - hx[1:n-2])


function foo(hx::StepRangeLen,hy::StepRangeLen,hz::StepRangeLen,
   var::Array{Float64,4})::Array{Float64,3}

   siz = size(var)::NTuple{4,Int64}

   px = zeros(Float64,siz[1:3]...)
   qy = zeros(Float64,siz[1:3]...)
   rz = zeros(Float64,siz[1:3]...)

   n = size(hx,1)

   if n > 2
      @btime compute_px1($siz,$px,$n,$hx,$var)
      @btime compute_px2($siz,$px,$n,$hx,$var)
      @time compute_px1(siz,px,n,hx,var)
      @time compute_px2(siz,px,n,hx,var)
      @time compute_px1(siz,px,n,hx,var)
      @time compute_px2(siz,px,n,hx,var)

   end

   px
end

function test()
   x = range(0.0, 1.0, length=10000)
   y = 0.5:0.5:1.5
   z = 0.5:0.5:1.5
   var = zeros(10000,3,3,3)
   px = foo(x,y,z,var);
end
test()

Actually the loop version runs faster as

function compute_px1(siz,px,n,hx,var)
   @inbounds for i=1:n-2, j=1:siz[2] , k=1:siz[3]
         px[i+1,j,k] = (var[i+2,j,k,1] - var[i,j,k,1])/(hx[i+2] - hx[i])
   end
end

105.376 μs (0 allocations: 0 bytes)
308.456 μs (5 allocations: 1.37 MiB)
0.000128 seconds (6 allocations: 352 bytes)
0.000342 seconds (11 allocations: 1.374 MiB)
0.000111 seconds (3 allocations: 128 bytes)
0.000332 seconds (8 allocations: 1.373 MiB)

You have a type instability with

julia> @code_warntype foo(x,y,z,var);
Variables
  ...
  px::Any
  qy::Any
  rz::Any

Those are type unstable because slicing tuples like that is unstable, so Julia isn’t sure how many arguments zeros will get (and thus doesn’t know which zeros is going to be called). Instead of slicing the tuple, use Base.front which does the magic you need:

julia> t = (1,2,3,4);

julia> @code_warntype t[1:3]
…
Body::Tuple{Vararg{Int64,N} where N}
…

julia> Base.front(t)
(1, 2, 3)

julia> @code_warntype Base.front(t)
Variables
  #self#::Core.Compiler.Const(Base.front, false)
  t::NTuple{4,Int64}

Body::Tuple{Int64,Int64,Int64}
…

julia> function foo(hx,hy,hz,var)

          siz = size(var)

          px = zeros(Float64,Base.front(siz))
          qy = zeros(Float64,Base.front(siz))
          rz = zeros(Float64,Base.front(siz))

          n = size(hx,1)

          if n > 2
             @time for k=1:siz[3], j=1:siz[2], i=1:n-2
                px[i+1,j,k] = (var[i+2,j,k,1] - var[i,j,k,1])/(hx[i+2] - hx[i])
             end
             @time @. px[2:n-1,:,:] = (var[3:n,:,:,1] - var[1:n-2,:,:,1])/(hx[3:n] - hx[1:n-2])
          end

          px
       end
foo (generic function with 2 methods)

julia> foo(x,y,z,var);
  0.001153 seconds
  0.001407 seconds (4 allocations: 1.373 MiB)

Now there are no longer any allocations in the for loop, and it is indeed faster! Those allocations were the canary in the coalmine chirping out the type instability.

1 Like

You want to use @views here probably. (@inbounds doesn’t do much, either.)

2 Likes

This is pretty interesting. I merged the suggested answers in one script:

using BenchmarkTools

function compute_px1(siz,px,n,hx,var)
   @inbounds for k=1:siz[3], j=1:siz[2], i=1:n-2
         px[i+1,j,k] = (var[i+2,j,k,1] - var[i,j,k,1])/(hx[i+2] - hx[i])
   end
end
function compute_px2(siz,px,n,hx,var)
   @inbounds for i=1:n-2, j=1:siz[2], k=1:siz[3]
         px[i+1,j,k] = (var[i+2,j,k,1] - var[i,j,k,1])/(hx[i+2] - hx[i])
   end
end
compute_px3(siz,px,n,hx,var) = @inbounds @. px[2:n-1,:,:] = (var[3:n,:,:,1] - var[1:n-2,:,:,1])/(hx[3:n] - hx[1:n-2])
compute_px4(siz,px,n,hx,var) = @views @. px[2:n-1,:,:] = (var[3:n,:,:,1] - var[1:n-2,:,:,1])/(hx[3:n] - hx[1:n-2])


function foo(hx::StepRangeLen,hy::StepRangeLen,hz::StepRangeLen,
   var::Array{Float64,4})::Array{Float64,3}

   siz = size(var)::NTuple{4,Int64}

   px = zeros(Float64,siz[1:3]...)
   qy = zeros(Float64,siz[1:3]...)
   rz = zeros(Float64,siz[1:3]...)

   n = size(hx,1)

   if n > 2
      @btime compute_px1($siz,$px,$n,$hx,$var)
      @btime compute_px2($siz,$px,$n,$hx,$var)
      @btime compute_px3($siz,$px,$n,$hx,$var)
      @btime compute_px4($siz,$px,$n,$hx,$var)
      @time compute_px1(siz,px,n,hx,var)
      @time compute_px2(siz,px,n,hx,var)
      @time compute_px3(siz,px,n,hx,var)
      @time compute_px4(siz,px,n,hx,var)
   end

   px
end

function foo_stable(hx::StepRangeLen,hy::StepRangeLen,hz::StepRangeLen,
   var::Array{Float64,4})::Array{Float64,3}

   siz = size(var)::NTuple{4,Int64}

   px = zeros(Float64,Base.front(siz))
   qy = zeros(Float64,Base.front(siz))
   rz = zeros(Float64,Base.front(siz))

   n = size(hx,1)

   if n > 2
      @btime compute_px1($siz,$px,$n,$hx,$var)
      @btime compute_px2($siz,$px,$n,$hx,$var)
      @btime compute_px3($siz,$px,$n,$hx,$var)
      @btime compute_px4($siz,$px,$n,$hx,$var)
      @time compute_px1(siz,px,n,hx,var)
      @time compute_px2(siz,px,n,hx,var)
      @time compute_px3(siz,px,n,hx,var)
      @time compute_px4(siz,px,n,hx,var)
   end

   px
end

function test()
   x = range(0.0, 1.0, length=10000)
   y = 0.5:0.5:1.5
   z = 0.5:0.5:1.5
   var = zeros(10000,3,3,3)
   px = foo(x,y,z,var);
   px = foo_stable(x,y,z,var);
end
test()

which gives me

  563.601 μs (0 allocations: 0 bytes)
  213.323 μs (0 allocations: 0 bytes)
  428.222 μs (5 allocations: 1.37 MiB)
  537.553 μs (3 allocations: 224 bytes)
  0.000674 seconds (4 allocations: 208 bytes)
  0.000289 seconds (4 allocations: 208 bytes)
  0.000536 seconds (9 allocations: 1.373 MiB)
  0.000616 seconds (7 allocations: 432 bytes)
  564.518 μs (0 allocations: 0 bytes)
  213.356 μs (0 allocations: 0 bytes)
  428.732 μs (5 allocations: 1.37 MiB)
  537.668 μs (3 allocations: 224 bytes)
  0.000609 seconds
  0.000248 seconds
  0.000484 seconds (5 allocations: 1.373 MiB)
  0.000605 seconds (3 allocations: 224 bytes)

So all together the explicit loop version with (i,j,k) indexing seems to be the optimal.

BTW, there’s no document for Base.front function…

function compute_px1(siz,px,n,hx,var)
   @inbounds for i=1:n-2, j=1:siz[2] , k=1:siz[3]
         px[i+1,j,k] = (var[i+2,j,k,1] - var[i,j,k,1])/(hx[i+2] - hx[i])
   end
end

Do you know why (i,j,k) is faster than (k,j,i) in this case?

I find it strange how buried and undocumented Base.front and Base.tail are, when they are so central to performance when working with tuples of unknown length or mixed types. We need to have a “Taking recursion seriously” discussion.

I did try using @views. It used less memory, but actually ran slower.

I think it is an advanced topic and debugging type inference in case it fails (Base.tail is one of the solutions, but does not always help) is somewhat tricky. Probably the lecture notes in

are the best introduction.

A tutorial-like writeup would be nice, but I think this is a topic for a blog post, not the manual (yet).