How to match performance of sum(A, dims=1)?

torrance · September 21, 2022, 4:32am

Well, this is a bit embarrassing… I’m writing up a set of slides to present to colleagues about Julia, and I wanted to show an example how explicit looping and so on is fine and dandy and as fast as any other language construct.

However! - I actually can’t get my mysum() function to match Julia’s sum(A, dims=1) function (summing a 2D array along the first axis):

arr = rand(1000, 100_000)

@btime sum($arr, dims=1);
# ~ 56 ms

function mysum(arr)
    result = zeros(eltype(arr), size(arr, 2))

    for j in axes(arr, 2)
        for i in axes(arr, 1)
            result[j] += arr[i, j]
        end
    end
    return result
end

@btime mysum($arr);
# ~ 80 ms

I’ve checked the obvious: its column-major iteration, there’s no allocations (except the initial result array), it’s type stable. I’ve also tried initialising an undef results array, and summing with a temporary variable for each row (in case it was the memory accesses on the heap that were slowing me down), also with no effect.

How do I speed this up?

DNF · September 21, 2022, 4:35am

Did you try @inbounds?

torrance · September 21, 2022, 4:36am

Yes, indeed. No effect.

torrance · September 21, 2022, 4:46am

Just upgraded from 1.8.0 to 1.8.1, and my version is now twice as slow. That’s a really curious performance degradation.

Oscar_Smith · September 21, 2022, 4:47am

have you tried @turbo?

torrance · September 21, 2022, 4:53am

@Oscar_Smith i have yes, but I wouldn’t expect it to have an effect since this is a reduction and SIMD doesn’t easily apply.

Tomas_Pevny · September 21, 2022, 5:01am

I think SIMD applies. I recall finding this out while preparing lecture for students last year. But @simd was sufficient.

Here is the link to source of our lecture

github.com

JuliaTeachingCTU/Scientific-Programming-in-Julia/blob/master/docs/src/lecture_05/lecture.md

# [Benchmarking, profiling, and performance gotchas](@id perf_lecture)

This class is a short introduction to writing a performant code. As such, we want to cover
- how to identify weak spots in the code
- how to properly benchmark
- common performance anti-patterns
- Julia's "performance gotchas", by which we mean performance problems specific for Julia (typical caused by the lack of understanding of Julia or by a errors in conversion from script to functions)

Though recall the most important rule of thumb: **Never optimize code from the very beginning.** A much more productive workflow is 
1. write the code that is idiomatic and easy to understand
2. cover the code with unit test, such that you know that the optimized code works the same as the original
3. optimize the code

Premature optimization frequently backfires, because:
- you might end-up optimizing wrong thing, i.e. you will not optimize performance bottleneck, but something very different
- optimized code can be difficult to read and reason about, which means it is more difficult to make it right.

It frequently happens that Julia newbies asks on forum that their code in Julia is slow in comparison to the same code in Python (numpy). Most of the time, they make trivial mistakes and it is very educative to go over their mistakes

## Numpy 10x faster than julia what am i doing wrong? (solved julia faster now) [^1]

This file has been truncated. show original

jar1 · September 21, 2022, 5:05am

@turbo works for me.

julia> function turbosum(arr)
           result = zeros(eltype(arr), size(arr, 2))
       
           @turbo for j in axes(arr, 2)
               for i in axes(arr, 1)
                   result[j] += arr[i, j]
               end
           end
           return result
       end
turbosum (generic function with 1 method)

julia> @btime turbosum($arr);
  31.189 ms (2 allocations: 781.30 KiB)

julia> @btime sum($arr; dims=1);
  42.052 ms (2 allocations: 781.30 KiB)

baggepinnen · September 21, 2022, 5:06am

That sounds pretty bad, I can confirm that results are quite bad for the manual version on v1.8.1, but I get equally bad timings on v1.7.3

torrance · September 21, 2022, 5:15am

Right, so with 1.8.1, @turbo indeed has a big improvement. But this is no longer the teaching moment I was hoping for, since ‘sprinkle with @turbo’ doesn’t really help people understand what’s going on here.

Also, I wonder how Base is doing this, without access to LoopVectorization magic…

nilshg · September 21, 2022, 5:27am

You could just look at the base code with @edit?

jar1 · September 21, 2022, 5:39am

The base code is created by macros and not easy to read.

github.com

JuliaLang/julia/blob/master/base/reducedim.jl#L988-L1025


      
          for (fname, _fname, op) in [(:sum,     :_sum,     :add_sum), (:prod,    :_prod,    :mul_prod),
                                      (:maximum, :_maximum, :max),     (:minimum, :_minimum, :min),
                                      (:extrema, :_extrema, :_extrema_rf)]
              mapf = fname === :extrema ? :(ExtremaMap(f)) : :f
              @eval begin
                  # User-facing methods with keyword arguments
                  @inline ($fname)(a::AbstractArray; dims=:, kw...) = ($_fname)(a, dims; kw...)
                  @inline ($fname)(f, a::AbstractArray; dims=:, kw...) = ($_fname)(f, a, dims; kw...)
          
          
        # Underlying implementations using dispatch
                  ($_fname)(a, ::Colon; kw...) = ($_fname)(identity, a, :; kw...)
                  ($_fname)(f, a, ::Colon; kw...) = mapreduce($mapf, $op, a; kw...)
              end
          end
          
          
any(a::AbstractArray; dims=:)              = _any(a, dims)
          any(f::Function, a::AbstractArray; dims=:) = _any(f, a, dims)
          _any(a, ::Colon)                           = _any(identity, a, :)
          all(a::AbstractArray; dims=:)              = _all(a, dims)
          all(f::Function, a::AbstractArray; dims=:) = _all(f, a, dims)

This file has been truncated. show original

jishnub · September 21, 2022, 5:41am

The difference seems entirely due to @simd:


julia> function mysum(arr)
                  result = zeros(eltype(arr), size(arr, 2))
                  @inbounds for j in axes(arr, 2)
                      t = zero(eltype(arr))
                      for i in axes(arr, 1)
                          t += arr[i, j]
                      end
                      result[j] = t
                  end
                  return result
              end
mysum (generic function with 1 method)

julia> @btime mysum($arr);
  165.281 ms (2 allocations: 781.30 KiB)

julia> function mysum(arr)
                  result = zeros(eltype(arr), size(arr, 2))
                  @inbounds for j in axes(arr, 2)
                      t = zero(eltype(arr))
                      @simd for i in axes(arr, 1)
                          t += arr[i, j]
                      end
                      result[j] = t
                  end
                  return result
              end
mysum (generic function with 1 method)

julia> @btime mysum($arr);
  53.112 ms (2 allocations: 781.30 KiB)

julia> @btime sum($arr, dims=1);
  52.268 ms (2 allocations: 781.30 KiB)

torrance · September 21, 2022, 5:48am

It’s turtles all the way down, but I eventually got to the all-important Base.mapreduce_impl() which does a little magic with block sizes and @simd.

You’re right. I need both @simd and a temporary loop variable for it to work.

I’m a little confused here though, since I didn’t think @simd worked with reduction variables, and now it seems to. Am I guaranteed that the sum won’t suffer from race conditions?

jishnub · September 21, 2022, 6:25am

I don’t think race conditions are an issue, since IIUC, we’re not accessing the same memory location from multiple tasks. What @simd does is reorder associative operations, so it may evaluate a + (b+c) instead of (a+b) + c. This means that if we use @simd in reductions, it will likely change the result due to floating-point rounding errors. We may check that mysum(arr) differs from vec(sum(arr, dims=1)), although they’re approximately equal.

Topic		Replies	Views
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4745	July 19, 2021
A fast sum. Any downsides? Performance sum	18	895	December 16, 2024
Does the julia intrinsic sum() apply fastmath by default? New to Julia fast-math , sum	15	746	February 21, 2023
Puzzled by the difference in performance when summing an array in different ways Performance array , memory-allocation	4	869	July 19, 2019
Summing arrays efficiently Performance	9	8267	September 22, 2018

How to match performance of sum(A, dims=1)?

Related topics