How to match performance of sum(A, dims=1)?

Well, this is a bit embarrassing… I’m writing up a set of slides to present to colleagues about Julia, and I wanted to show an example how explicit looping and so on is fine and dandy and as fast as any other language construct.

However! - I actually can’t get my mysum() function to match Julia’s sum(A, dims=1) function (summing a 2D array along the first axis):

arr = rand(1000, 100_000)

@btime sum($arr, dims=1);
# ~ 56 ms

function mysum(arr)
    result = zeros(eltype(arr), size(arr, 2))

    for j in axes(arr, 2)
        for i in axes(arr, 1)
            result[j] += arr[i, j]
        end
    end
    return result
end

@btime mysum($arr);
# ~ 80 ms

I’ve checked the obvious: its column-major iteration, there’s no allocations (except the initial result array), it’s type stable. I’ve also tried initialising an undef results array, and summing with a temporary variable for each row (in case it was the memory accesses on the heap that were slowing me down), also with no effect.

How do I speed this up?

2 Likes

Did you try @inbounds?

Yes, indeed. No effect.

Just upgraded from 1.8.0 to 1.8.1, and my version is now twice as slow. That’s a really curious performance degradation.

have you tried @turbo?

@Oscar_Smith i have yes, but I wouldn’t expect it to have an effect since this is a reduction and SIMD doesn’t easily apply.

I think SIMD applies. I recall finding this out while preparing lecture for students last year. But @simd was sufficient.

Here is the link to source of our lecture

@turbo works for me.

julia> function turbosum(arr)
           result = zeros(eltype(arr), size(arr, 2))
       
           @turbo for j in axes(arr, 2)
               for i in axes(arr, 1)
                   result[j] += arr[i, j]
               end
           end
           return result
       end
turbosum (generic function with 1 method)

julia> @btime turbosum($arr);
  31.189 ms (2 allocations: 781.30 KiB)

julia> @btime sum($arr; dims=1);
  42.052 ms (2 allocations: 781.30 KiB)

That sounds pretty bad, I can confirm that results are quite bad for the manual version on v1.8.1, but I get equally bad timings on v1.7.3

Right, so with 1.8.1, @turbo indeed has a big improvement. But this is no longer the teaching moment I was hoping for, since ‘sprinkle with @turbo’ doesn’t really help people understand what’s going on here.

Also, I wonder how Base is doing this, without access to LoopVectorization magic…

1 Like

You could just look at the base code with @edit?

The base code is created by macros and not easy to read.

The difference seems entirely due to @simd:


julia> function mysum(arr)
                  result = zeros(eltype(arr), size(arr, 2))
                  @inbounds for j in axes(arr, 2)
                      t = zero(eltype(arr))
                      for i in axes(arr, 1)
                          t += arr[i, j]
                      end
                      result[j] = t
                  end
                  return result
              end
mysum (generic function with 1 method)

julia> @btime mysum($arr);
  165.281 ms (2 allocations: 781.30 KiB)

julia> function mysum(arr)
                  result = zeros(eltype(arr), size(arr, 2))
                  @inbounds for j in axes(arr, 2)
                      t = zero(eltype(arr))
                      @simd for i in axes(arr, 1)
                          t += arr[i, j]
                      end
                      result[j] = t
                  end
                  return result
              end
mysum (generic function with 1 method)

julia> @btime mysum($arr);
  53.112 ms (2 allocations: 781.30 KiB)

julia> @btime sum($arr, dims=1);
  52.268 ms (2 allocations: 781.30 KiB)
1 Like

It’s turtles all the way down, but I eventually got to the all-important Base.mapreduce_impl() which does a little magic with block sizes and @simd.

You’re right. I need both @simd and a temporary loop variable for it to work.

I’m a little confused here though, since I didn’t think @simd worked with reduction variables, and now it seems to. Am I guaranteed that the sum won’t suffer from race conditions?

1 Like

I don’t think race conditions are an issue, since IIUC, we’re not accessing the same memory location from multiple tasks. What @simd does is reorder associative operations, so it may evaluate a + (b+c) instead of (a+b) + c. This means that if we use @simd in reductions, it will likely change the result due to floating-point rounding errors. We may check that mysum(arr) differs from vec(sum(arr, dims=1)), although they’re approximately equal.

2 Likes