Sum isn't faster than a loop?

tautomer · June 24, 2020, 7:40pm

Hi guys,

I’m computing a derivative from a summation. I thought vectorizing the function would be faster than a loop, but it isn’t the case.

using BenchmarkTools, Random

ωc = 0.05
chi = 0.01
const k = ωc^2
const couple = sqrt(2/ωc^3) * chi
const coeff = reshape([1.359e+00, 8.143e-01, 3.142e+00, 7.304e-02, 3.391e+00,
    3.142e+00, 3.099e-01, 2.009e+00, -3.142e+00, 1.714e-02, 4.847e+00,
    3.142e+00, 4.495e-03, 6.291e+00, 3.142e+00], 3, 5)

function fitted(x::T) where T<:Real
    u = 0.0
    for i in 1:5
        u += coeff[1, i] * sin(coeff[2, i] * x + coeff[3, i])
    end
    return u
end

function ∂U∂q1(x::AbstractVector{T}) where T<:Real
    ∑μ = 0.0
    for i in 1:length(x)-1
        ∑μ += fitted(x[i])
    end
    return k * (couple * ∑μ + x[end])
end

function ∂U∂q2(x::AbstractVector{T}) where T<:Real
    ∑μ = sum(fitted.(view(x, 1:length(x)-1)))
    return k * (couple * ∑μ + x[end])
end


n = 100
x = Random.rand(n)
@btime ∂U∂q1(x)
@btime ∂U∂q2(x)

Ouput for n = 10

  254.728 ns (1 allocation: 16 bytes)
  302.381 ns (3 allocations: 224 bytes)

Ouput for n = 100

  2.711 μs (1 allocation: 16 bytes)
  2.767 μs (3 allocations: 960 bytes)

Ouput for n = 100

  5.567 μs (1 allocation: 16 bytes)
  5.667 μs (3 allocations: 1.83 KiB)

So the first version with a hand-written loop is both faster (very slightly maybe) and more memory efficient.

Is there a way to make sum faster and also avoid additional allocations? Or if there is any even better way of implementing the summation?

bernhard · June 24, 2020, 7:43pm

I am no expert on this, but I think we/people (me included) assume that vectorization makes things faster, because that is the case in certain languages such as R.

To quote the julia docs:
In R, performance requires vectorization. In Julia, almost the opposite is true: the best performing code is often achieved by using devectorized loops.

see for instance here
https://docs.julialang.org/en/v1/manual/noteworthy-differences/

tbeason · June 24, 2020, 7:50pm

One thing that inevitably comes up in these threads is that your naive summing algorithm is different than what Julia’s sum uses, which is a more accurate floating point summation algorithm. If you are certain that your problem won’t encounter any of the errors that plague the simple straight summation, then that is fine. If you might encounter catastrophic cancellation or anything like that, go with Base sum.

tautomer · June 24, 2020, 7:54pm

Hmm… I think this is exactly what I have seen during some tests last week I did.

tautomer · June 24, 2020, 7:56pm

I think it’s unlikely to happen. It might happen for this trivial example code.
But in my reall code, x[2:end-1] are all around a certain value, so they have pretty much the same output from the fitted function.

rdeits · June 24, 2020, 7:57pm

Yup, “vectorizing”, in the sense that Matlab and numpy use it, is generally neither necessary nor useful in Julia. If you need a loop, just use a loop!

I discussed this example with sum in particular in this talk: https://github.com/rdeits/DetroitTechWatch2020.jl/blob/master/Intro%20to%20Julia.ipynb (search for “Julia is fast”)

mcabbott · June 24, 2020, 8:06pm

Note that you can also write this:

    v = view(x, 1:length(x)-1)
    ∑μ = sum(fitted, v)

or this

using LazyArrays: @~ # just to construct Base.Broadcast.Broadcasted
...
    v = view(x, 1:length(x)-1)
    ∑μ = sum(@~ fitted.(v))

Both should avoid the allocations, and the lazy broadcast sum is fast on Julia 1.5. They won’t beat the simple loop, but can be tidier.

tautomer · June 24, 2020, 8:25pm

Thanks a lot for this document! I have to bookmark it, so I can spend some time to digest it.

tautomer · June 24, 2020, 8:28pm

Both indeed get rid of the large allocation. And yes, they are slightly slower, but surely more readable than what I wrote.
I think view’s allocation problem is solved on 1.5, but on 1.4.1 there are still 1 or 2 allocations.

Seif_Shebl · June 24, 2020, 9:08pm

As others said, sum is more accurate because it takes care of rounding. Also, you don’t need to use broadcasting and views to avoid allocations completely. Note that sum accepts a function as first argument, so you can simply do:

function ∂U∂q2(x::AbstractVector{T}) where T<:Real
    # ∑μ = sum(fitted.(view(x, 1:length(x)-1)))
    ∑μ = sum(fitted, x[i] for i = 1:length(x)-1)
    return k * (couple * ∑μ + x[end])
end

which when timed in a script, can yield same speed as loops:

n = 100
x = Random.rand(n)
@btime ∂U∂q1($x)
@btime ∂U∂q2($x)

  3.112 μs (0 allocations: 0 bytes)
  3.125 μs (0 allocations: 0 bytes)

In REPL, however, I observe a tiny (within 3%) edge in favor of manual loops. These results are obtained from Julia 1.5.0-beta1.0.

DNF · June 24, 2020, 9:38pm

Remember to interpolate when benchmarking with BenchmarkTools.jl, at least when using non-const globals

@btime ∂U∂q1($x)  # note the $ sign
@btime ∂U∂q2($x)

Seif_Shebl · June 24, 2020, 9:53pm

Yes, thank you. I copy-pasted OP’s code and didn’t notice it. Now I see a total 0 allocations. Edited.

Topic		Replies	Views
Faster sum loop when looping through vector multiple times than once Performance	5	285	September 29, 2023
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4798	July 19, 2021
Loop vs vectorization Performance	4	1399	October 10, 2023
Why is sum slower than multiplying by vector of ones? Performance	4	463	September 28, 2023
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1321	April 17, 2017

Sum isn't faster than a loop?

Related topics