Sum isn't faster than a loop?

Hi guys,

I’m computing a derivative from a summation. I thought vectorizing the function would be faster than a loop, but it isn’t the case.

using BenchmarkTools, Random

ωc = 0.05
chi = 0.01
const k = ωc^2
const couple = sqrt(2/ωc^3) * chi
const coeff = reshape([1.359e+00, 8.143e-01, 3.142e+00, 7.304e-02, 3.391e+00,
    3.142e+00, 3.099e-01, 2.009e+00, -3.142e+00, 1.714e-02, 4.847e+00,
    3.142e+00, 4.495e-03, 6.291e+00, 3.142e+00], 3, 5)

function fitted(x::T) where T<:Real
    u = 0.0
    for i in 1:5
        u += coeff[1, i] * sin(coeff[2, i] * x + coeff[3, i])
    return u

function ∂U∂q1(x::AbstractVector{T}) where T<:Real
    ∑μ = 0.0
    for i in 1:length(x)-1
        ∑μ += fitted(x[i])
    return k * (couple * ∑μ + x[end])

function ∂U∂q2(x::AbstractVector{T}) where T<:Real
    ∑μ = sum(fitted.(view(x, 1:length(x)-1)))
    return k * (couple * ∑μ + x[end])

n = 100
x = Random.rand(n)
@btime ∂U∂q1(x)
@btime ∂U∂q2(x)

Ouput for n = 10

  254.728 ns (1 allocation: 16 bytes)
  302.381 ns (3 allocations: 224 bytes)

Ouput for n = 100

  2.711 μs (1 allocation: 16 bytes)
  2.767 μs (3 allocations: 960 bytes)

Ouput for n = 100

  5.567 μs (1 allocation: 16 bytes)
  5.667 μs (3 allocations: 1.83 KiB)

So the first version with a hand-written loop is both faster (very slightly maybe) and more memory efficient.

Is there a way to make sum faster and also avoid additional allocations? Or if there is any even better way of implementing the summation?

I am no expert on this, but I think we/people (me included) assume that vectorization makes things faster, because that is the case in certain languages such as R.

To quote the julia docs:
In R, performance requires vectorization. In Julia, almost the opposite is true: the best performing code is often achieved by using devectorized loops.

see for instance here


One thing that inevitably comes up in these threads is that your naive summing algorithm is different than what Julia’s sum uses, which is a more accurate floating point summation algorithm. If you are certain that your problem won’t encounter any of the errors that plague the simple straight summation, then that is fine. If you might encounter catastrophic cancellation or anything like that, go with Base sum.


Hmm… I think this is exactly what I have seen during some tests last week I did.

I think it’s unlikely to happen. It might happen for this trivial example code.
But in my reall code, x[2:end-1] are all around a certain value, so they have pretty much the same output from the fitted function.

Yup, “vectorizing”, in the sense that Matlab and numpy use it, is generally neither necessary nor useful in Julia. If you need a loop, just use a loop!

I discussed this example with sum in particular in this talk: (search for “Julia is fast”)


Note that you can also write this:

    v = view(x, 1:length(x)-1)
    ∑μ = sum(fitted, v)

or this

using LazyArrays: @~ # just to construct Base.Broadcast.Broadcasted
    v = view(x, 1:length(x)-1)
    ∑μ = sum(@~ fitted.(v))

Both should avoid the allocations, and the lazy broadcast sum is fast on Julia 1.5. They won’t beat the simple loop, but can be tidier.


Thanks a lot for this document! I have to bookmark it, so I can spend some time to digest it. :grinning:

Both indeed get rid of the large allocation. And yes, they are slightly slower, but surely more readable than what I wrote.
I think view's allocation problem is solved on 1.5, but on 1.4.1 there are still 1 or 2 allocations.

As others said, sum is more accurate because it takes care of rounding. Also, you don’t need to use broadcasting and views to avoid allocations completely. Note that sum accepts a function as first argument, so you can simply do:

function ∂U∂q2(x::AbstractVector{T}) where T<:Real
    # ∑μ = sum(fitted.(view(x, 1:length(x)-1)))
    ∑μ = sum(fitted, x[i] for i = 1:length(x)-1)
    return k * (couple * ∑μ + x[end])

which when timed in a script, can yield same speed as loops:

n = 100
x = Random.rand(n)
@btime ∂U∂q1($x)
@btime ∂U∂q2($x)

  3.112 μs (0 allocations: 0 bytes)
  3.125 μs (0 allocations: 0 bytes)

In REPL, however, I observe a tiny (within 3%) edge in favor of manual loops. These results are obtained from Julia 1.5.0-beta1.0.


Remember to interpolate when benchmarking with BenchmarkTools.jl, at least when using non-const globals

@btime ∂U∂q1($x)  # note the $ sign
@btime ∂U∂q2($x)

Yes, thank you. I copy-pasted OP’s code and didn’t notice it. Now I see a total 0 allocations. Edited.