# Speed up this code game

#1

Hi,

Brag points are up for grabs again. How can this code be made faster?

``````using BenchmarkTools
using Base.Test
struct Point{X,Y}
x::X
y::Y
end
function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
area = zero(T)
p1 = v[1]
for i in 2:length(v)
p2 = v[i]
dx = p2.x - p1.x
m = (p2.y - p1.y)/dx
b = p1.y
area += m/3*m*(dx*dx*dx) + b*m*(dx*dx) + b*b*(dx)
p1 = p2
end
return area
end
# Test code
N = 1_000_000
pwl = [Point(x, sin(x)) for x in linspace(0, pi, N)]
@test integral_squared(pwl) ā pi/2  atol=0.00001

println("Benchmarking:")
@btime integral_squared(pwl)
``````

Brag points (genuine bullion to wear around your neck):
10: 10% faster
100: 25% faster
1000: 50% faster

Being it is not an easy challenge, the points are high. Most the time is spent on the `area +=` line. The instructions per cycle is about 0.65 (according to `tiptop`) and it could go as high as 4 on most systems so there is room for improvement. Hopefully weāll learn something new.

Thanks,

Glen

#2

That was easy. You need a Haswell+ (or any other processors with FMA support (AVX2))

The following small tweak changes the execution time on my laptop from 2.65ms to 1.61ms.

``````function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
area = zero(T)
p1 = v[1]
@inbounds for i in 2:length(v)
p2 = v[i]
dx = p2.x - p1.x
m = (p2.y - p1.y) / dx
b = p1.y
area += @evalpoly(dx * m, b * b, b, 1 / 3) * dx
p1 = p2
end
return area
end
``````

#3

FWIW, I get an IPC of 2.24 and 2.47 for the old and new version on my laptop. On an old system that doesnāt support FMA, the old version got an IPC of 0.68 and execution time of ~22ms, the new version got an IPC of 0.97 and execution timem of ~12.4ms.

#4

This is about 4-5 times faster on my computer.

``````function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
area = zero(T)
p1 = v[1]
third = 1/3
for i in 2:length(v)
p2 = v[i]
dx = p2.x - p1.x
m = (p2.y - p1.y)
b = p1.y
area += dx * (m * m * third + b * (m + b))
p1 = p2
end
return area
end
``````

#5

Avoid divisions! Especially, avoid dividing by something which you later multiply by again.

``````const third = 1/3
function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
area = zero(oneunit(X)*oneunit(Y)*oneunit(Y))
p1 = v[1]
@inbounds for i in 2:length(v)
p2 = v[i]
dx = p2.x - p1.x
m  = p2.y - p1.y
b  = p1.y
area += (third*m*m + (m+b)*b)*dx
p1 = p2
end
return area
end
``````

Note, `code_warntype` told me that `T` is `Any` though the type of `area` was inferred correctly. So I rewrote the initialisation of area, but this makes no performance difference.

Timings:

• original: 3.411 ms (1 allocation: 16 bytes)
• @yuyichaoās: 2.468 ms (1 allocation: 16 bytes)
• mine: 1.748 ms (1 allocation: 16 bytes)

Version Info:

Julia Version 0.6.0-rc1.0
Commit 6bdb395 (2017-05-07 00:00 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: IntelĀ® Coreā¢ i5-7200U CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
LAPACK: libopenblas64_
LIBM: libopenlibm

#6

As a curiosity, hereās the impact of a few tweaks on @traktofonās code (0.6 beta, Haswell processor)

• `@fastmath` shaves another 10% (at the expense of a slightly different result)
• precomputing 1/3 as āthirdā has no effect
• removing p1 and p2 to adress v directly has no effect
• using `muladd` has no effect

#7

Thanks everybody. Lots of points taken! I found that the last term in the `area` equation can be further simplified. Hereās what I have so far:

``````function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
area = zero(T)
p1 = v[1]
x1 = p1.x
y1 = p1.y
third = 1/3
for i in 2:length(v)
p2 = v[i]
x2 = p2.x
y2 = p2.y
dx = x2 - x1
dy = y2 - y1
area += dx*(dy*dy*third + y1*y2)
x1 = x2
y1 = y2
end
return area
end
``````

My IPC is now about 1.8 and instead of 14.5 ms it is down to 3 ms . Now that the math has no divisions and more symmetric, could SIMD.jl be used to vectorize this further? Looking at the docs Iām not sure how to map this into SIMD.

Glen

#8

Here is something to get you started with SIMD.jl

``````using SIMD
function integral_simd(x::Vector{T}, y::Vector{T}, ::Type{Vec{N, T}}) where {N, T}
@assert length(x) == length(y)
@assert length(x) % N == 1
area = Vec{N, T}(0)
third = Vec{N, T}(1 / 3)
for i = 1:N:(length(x)-1)
x1 = vload(Vec{N, T}, x, i)
y1 = vload(Vec{N, T}, y, i)
x2 = vload(Vec{N, T}, x, i+1)
y2 = vload(Vec{N, T}, y, i+1)
dx = x2 - x1
dy = y2 - y1
area += dx * (dy * dy * third + y1 * y2)
x1 = x2
y1 = y2
end
return sum(area)
end

N = 1_000_001
x = map(Float32, collect(linspace(0, pi, N)))
y = sin.(x)
@test integral_simd(x, y, Vec{8, Float32}) ā pi/2  atol=0.00001
@btime integral_simd(x, y, Vec{8, Float32})
``````

Edit: Remove the `x1 = x2` and `y1 = y2` lines. They mistakenly survived from an earlier iteration.

#9

That is a very helpful example, thank-you. I didnāt think SIMD would be that easy. I guess having `Point{X,Y}` would not work with SIMD?

Here are my timings from this thread:

Original: 14.5 ms
Post 7 + @inbounds: 3.1 ms
Post 8 64-bit: 2.0 ms (SIMD)

That is over a 7x speed-up ! Also if 32-bit SIMD is acceptable then the time shrinks to 0.73 ms (about 20x faster).

Iāve learned a lot. Thanks everybody.

Glen

#10

A multi-threading version, just for completeness.

``````using Base.Threads

function integral_mt{T}(x::Vector{T},y::Vector{T})
area = zero(T)
third = T(1/3)
nsub = 512
nouter = ceil(Int,length(x)/nsub)
accs = [zeros(T,nsub) for i in 1:nt]
@inbounds begin
acc = accs[id]
ioff = nsub * i
ninner = ifelse(i==nouter-1,length(x)-ioff-1,nsub)
@simd for j=1:ninner
x1 = x[ioff+j]
y1 = y[ioff+j]
x2 = x[ioff+j+1]
y2 = y[ioff+j+1]
dx = x2 - x1
dy = y2 - y1
acc[j] += dx*(dy*dy*third + y1*y2)
end
end
end
area = sum(map(sum,accs))
return area
end
``````

Benchmark result (8 threads on 8 cores, `Float64`):

``````  120.479 Ī¼s (12 allocations: 33.38 KiB)
``````

• benchmark shows that problem fits in L3 cache on my system, which might be cheating
• above result required a quiet system (e.g., no Firefox running)
• it takes some work to avoid cache contention - the overhead makes this slower for 1 thread than post 8 above
• LLVM does the floating-point SIMD here even without the SIMD.jl package
• it would require gymnastics to use SIMD wide loads with the `Point` structures; without them there is a severe penalty vs. vectors of floats.

#11

Whatās up with that struct definition ?
i was looking in the Julia docs for it and canāt seem to find it.

#12

Itās the new syntax for composite types in julia v0.6.

#13

Hi,

Can anyone provide a GPU or Parallel Accelerator example, etc?

Ralph, your code gives me about a 3.5x speed-up on 4 threads (JULIA_NUM_THREADS=4), very impressive! Can you explain why you chose `nsub=512` and looping structure? I donāt think your code is using SIMD at all (from looking at `@code_llvm` and `@code_native`) yet with one thread the performance is the same as post 8 with explicit SIMD (2ms). Iām not sure why that is.

As for SIMD with the `Vector{Point{T,T}}` I was able to produce this:

``````function integral_simd(v::Vector{Point{T,T}}, ::Type{Vec{8, T}}) where {T}
N = 8
@assert length(v) % N == 1
area = Vec{N, T}(0)
third = Vec{N, T}(1 / 3)
@inbounds for i = 1:N:(length(v)-1)
x1 = Vec{N, T}((v[i+0].x, v[i+1].x, v[i+2].x, v[i+3].x, v[i+4].x, v[i+5].x, v[i+6].x, v[i+7].x))
y1 = Vec{N, T}((v[i+0].y, v[i+1].y, v[i+2].y, v[i+3].y, v[i+4].y, v[i+5].y, v[i+6].y, v[i+7].y))
x2 = Vec{N, T}((v[i+1].x, v[i+2].x, v[i+3].x, v[i+4].x, v[i+5].x, v[i+6].x, v[i+7].x, v[i+8].x))
y2 = Vec{N, T}((v[i+1].y, v[i+2].y, v[i+3].y, v[i+4].y, v[i+5].y, v[i+6].y, v[i+7].y, v[i+8].y))
dx = x2 - x1
dy = y2 - y1
area += dx * (dy * dy * third + y1 * y2)
end
return sum(area)
end
``````

The speed is 2.2 ms which is about 10% slower than separate `x` and `y` vectors (post 8) but still a decent speed-up vs without SIMD (post 7). I couldnāt figure out a way to make `N` a variable without degrading the performance (eg using `Vec{N,T}(ntuple(i->v[i].x, N))` is about 500x slower).

Glen

#14

The nested loops (ātilingā in the parallel programming lingo, I think) get working variables into the lower level caches. 512 was a guess based on my experience with the machines I use, and worked well enough here.

You donāt see the SIMD instructions because the real work is done in a closure written by the threading macro, which is not shown by the code tools. The inner loop should be almost the same code as is generated by the serial version (dropping the `@threads` macro), where the SIMD instructions should be visible.