Generators speed


#1

Hi,

julia> N = 10000000; A = randn(N); @time sum(A); @time sum(A[i] for i in 1:N); @time sum(a for a in A)
  0.007686 seconds (5 allocations: 176 bytes)
  0.640131 seconds (30.02 M allocations: 458.566 MB, 13.64% gc time)
  0.039558 seconds (16.76 k allocations: 713.857 KB)

in both 0.5 and 0.6rc. Am I wrong in thinking the second and third versions should also be fast, based on the paragraph “Generators” of https://julialang.org/blog/2016/10/julia-0.5-highlights?


#2

You should not benchmark in global scope.

using BenchmarkTools

N = 10000000
A = randn(N)

sum1(A) = sum(A)
sum2(A) = sum(A[i] for i in 1:N)
sum3(A) = sum(a for a in A)

then

julia> @benchmark sum1($A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.576 ms (0.00% GC)
  median time:      5.744 ms (0.00% GC)
  mean time:        5.672 ms (0.00% GC)
  maximum time:     6.901 ms (0.00% GC)
  --------------
  samples:          881
  evals/sample:     1

julia> @benchmark sum2($A)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  5
  --------------
  minimum time:     12.764 ms (0.00% GC)
  median time:      12.885 ms (0.00% GC)
  mean time:        12.972 ms (0.00% GC)
  maximum time:     16.242 ms (0.00% GC)
  --------------
  samples:          386
  evals/sample:     1

julia> @benchmark sum3($A)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  3
  --------------
  minimum time:     12.761 ms (0.00% GC)
  median time:      12.874 ms (0.00% GC)
  mean time:        12.941 ms (0.00% GC)
  maximum time:     16.360 ms (0.00% GC)
  --------------
  samples:          387
  evals/sample:     1

so the difference is 2-3x. Above using v0.6-rc1.


#3

FWIW sum2 is still using global variables…


#4

That’s right, sorry about that! I was extracting the code from a function that had weird performance and forgot about the (very annoying) globals problem. So here’s a more interesting benchmark, closer to what I’m actually doing:

using BenchmarkTools

const N = 10000000
A = randn(N)

sum1(A) = sum(A.*A.+A)
sum2(A) = sum(A.*A+A)
sum3(A) = sum(A[i]*A[i]+A[i] for i in 1:N)
sum4(A) = sum(a*a+a for a in A)
function sum5(A)
    res = 0.0
    for i = 1:N
        res += A[i]*A[i]+A[i]
    end
end

BenchmarkTools.DEFAULT_PARAMETERS.samples = 10
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2

@btime sum1($A)
@btime sum2($A)
@btime sum3($A)
@btime sum4($A)
@btime sum5($A)

I get

  36.951 ms (2 allocations: 76.29 MiB)
  80.213 ms (4 allocations: 152.59 MiB)
  12.607 ms (4 allocations: 80 bytes)
  12.628 ms (3 allocations: 48 bytes)
  4.011 ms (0 allocations: 0 bytes)

I understand the first two timings (they imply a vector allocation), but not the third and fourth. Is any of these timings considered a bug? Does it mean I’m condemned to boring loops? :frowning:


#5

How fast is your sum5 if you actually return the result?

This is maybe too specialized to generalize to your real problem but you can also compare

sum6(A) = sum(x -> x * x + x, A)

#6

That’s a good point… then the speed difference vanishes. I also included an @inbounds @simd version of the loop. The final benchmark is

using BenchmarkTools

const N = 10000000
A = randn(N)

sum1(A) = sum(A.*A.+A)
sum2(A) = sum(A.*A+A)
sum3(A) = sum(A[i]*A[i]+A[i] for i in 1:N)
sum4(A) = sum(a*a+a for a in A)
function sum5(A)
    res = 0.0
    for i = 1:N
        res += A[i]*A[i]+A[i]
    end
    res
end
function sum6(A)
    res = 0.0
    @inbounds @simd for i = 1:N
        res += A[i]*A[i]+A[i]
    end
    res
end
sum7(A) = sum(x -> x * x + x, A)


BenchmarkTools.DEFAULT_PARAMETERS.samples = 10
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2

@btime sum1($A)
@btime sum2($A)
@btime sum3($A)
@btime sum4($A)
@btime sum5($A)
@btime sum6($A)
@btime sum7($A)

with result (0.6 rc)

  38.080 ms (2 allocations: 76.29 MiB)
  84.583 ms (4 allocations: 152.59 MiB)
  12.619 ms (4 allocations: 80 bytes)
  12.606 ms (3 allocations: 48 bytes)
  12.629 ms (0 allocations: 0 bytes)
  7.230 ms (0 allocations: 0 bytes)
  7.297 ms (0 allocations: 0 bytes)

(neither @inbounds nor @simd by itself is able to get the 7ms, both are needed)

Amazingly, sum(x -> x * x + x, A) is able to SIMD automatically! I don’t understand how that can happen: looking at the implementation, it falls back to mapfoldl_impl, which is a simple while loop…