Speed up this code game

Glen · May 12, 2017, 2:46am

Hi,

Brag points are up for grabs again. How can this code be made faster?

using BenchmarkTools
using Base.Test
struct Point{X,Y}
    x::X
    y::Y
end
function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m = (p2.y - p1.y)/dx
        b = p1.y
        area += m/3*m*(dx*dx*dx) + b*m*(dx*dx) + b*b*(dx)
        p1 = p2
    end
    return area
end
# Test code
N = 1_000_000
pwl = [Point(x, sin(x)) for x in linspace(0, pi, N)]
@test integral_squared(pwl) ≈ pi/2  atol=0.00001

println("Benchmarking:")
@btime integral_squared(pwl)

Brag points (genuine bullion to wear around your neck):
10: 10% faster
100: 25% faster
1000: 50% faster

Being it is not an easy challenge, the points are high. Most the time is spent on the area += line. The instructions per cycle is about 0.65 (according to tiptop) and it could go as high as 4 on most systems so there is room for improvement. Hopefully we’ll learn something new.

Thanks,

Glen

yuyichao · May 12, 2017, 4:33am

That was easy. You need a Haswell+ (or any other processors with FMA support (AVX2))

The following small tweak changes the execution time on my laptop from 2.65ms to 1.61ms.

function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    @inbounds for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m = (p2.y - p1.y) / dx
        b = p1.y
        area += @evalpoly(dx * m, b * b, b, 1 / 3) * dx
        p1 = p2
    end
    return area
end

yuyichao · May 12, 2017, 4:52am

FWIW, I get an IPC of 2.24 and 2.47 for the old and new version on my laptop. On an old system that doesn’t support FMA, the old version got an IPC of 0.68 and execution time of ~22ms, the new version got an IPC of 0.97 and execution timem of ~12.4ms.

GunnarFarneback · May 12, 2017, 6:14am

This is about 4-5 times faster on my computer.

function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    third = 1/3
    for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m = (p2.y - p1.y)
        b = p1.y
        area += dx * (m * m * third + b * (m + b))
        p1 = p2
    end
    return area
end

traktofon · May 12, 2017, 6:18am

Avoid divisions! Especially, avoid dividing by something which you later multiply by again.

const third = 1/3
function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    area = zero(oneunit(X)*oneunit(Y)*oneunit(Y))
    p1 = v[1]
    @inbounds for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m  = p2.y - p1.y
        b  = p1.y
        area += (third*m*m + (m+b)*b)*dx
        p1 = p2
    end
    return area
end

Note, code_warntype told me that T is Any though the type of area was inferred correctly. So I rewrote the initialisation of area, but this makes no performance difference.

Timings:

original: 3.411 ms (1 allocation: 16 bytes)
@yuyichao’s: 2.468 ms (1 allocation: 16 bytes)
mine: 1.748 ms (1 allocation: 16 bytes)

Version Info:

Julia Version 0.6.0-rc1.0
Commit 6bdb395 (2017-05-07 00:00 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core™ i5-7200U CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

antoine-levitt · May 12, 2017, 6:42am

As a curiosity, here’s the impact of a few tweaks on @traktofon’s code (0.6 beta, Haswell processor)

@fastmath shaves another 10% (at the expense of a slightly different result)
precomputing 1/3 as “third” has no effect
removing p1 and p2 to adress v directly has no effect
using muladd has no effect

Glen · May 12, 2017, 2:28pm

Thanks everybody. Lots of points taken! I found that the last term in the area equation can be further simplified. Here’s what I have so far:

function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    x1 = p1.x
    y1 = p1.y
    third = 1/3
    for i in 2:length(v)
        p2 = v[i]
        x2 = p2.x
        y2 = p2.y
        dx = x2 - x1
        dy = y2 - y1
        area += dx*(dy*dy*third + y1*y2)
        x1 = x2
        y1 = y2
    end
    return area
end

My IPC is now about 1.8 and instead of 14.5 ms it is down to 3 ms . Now that the math has no divisions and more symmetric, could SIMD.jl be used to vectorize this further? Looking at the docs I’m not sure how to map this into SIMD.

Glen

GunnarFarneback · May 12, 2017, 3:31pm

Here is something to get you started with SIMD.jl

using SIMD
function integral_simd(x::Vector{T}, y::Vector{T}, ::Type{Vec{N, T}}) where {N, T}
    @assert length(x) == length(y)
    @assert length(x) % N == 1
    area = Vec{N, T}(0)
    third = Vec{N, T}(1 / 3)
    for i = 1:N:(length(x)-1)
        x1 = vload(Vec{N, T}, x, i)
        y1 = vload(Vec{N, T}, y, i)
        x2 = vload(Vec{N, T}, x, i+1)
        y2 = vload(Vec{N, T}, y, i+1)
        dx = x2 - x1
        dy = y2 - y1
        area += dx * (dy * dy * third + y1 * y2)
        x1 = x2
        y1 = y2
    end
    return sum(area)
end

N = 1_000_001
x = map(Float32, collect(linspace(0, pi, N)))
y = sin.(x)
@test integral_simd(x, y, Vec{8, Float32}) ≈ pi/2  atol=0.00001
@btime integral_simd(x, y, Vec{8, Float32})

Edit: Remove the x1 = x2 and y1 = y2 lines. They mistakenly survived from an earlier iteration.

Glen · May 13, 2017, 3:43pm

That is a very helpful example, thank-you. I didn’t think SIMD would be that easy. I guess having Point{X,Y} would not work with SIMD?

Here are my timings from this thread:

Original: 14.5 ms
Post 7 + @inbounds: 3.1 ms
Post 8 64-bit: 2.0 ms (SIMD)

That is over a 7x speed-up ! Also if 32-bit SIMD is acceptable then the time shrinks to 0.73 ms (about 20x faster).

I’ve learned a lot. Thanks everybody.

Glen

Ralph_Smith · May 13, 2017, 6:48pm

A multi-threading version, just for completeness.

using Base.Threads

function integral_mt{T}(x::Vector{T},y::Vector{T})
    area = zero(T)
    third = T(1/3)
    nsub = 512
    nouter = ceil(Int,length(x)/nsub)
    nt = nthreads()
    accs = [zeros(T,nsub) for i in 1:nt]
    @threads for i in 0:nouter-1
        @inbounds begin
        id = threadid()
        acc = accs[id]
        ioff = nsub * i
        ninner = ifelse(i==nouter-1,length(x)-ioff-1,nsub)
        @simd for j=1:ninner
            x1 = x[ioff+j]
            y1 = y[ioff+j]
            x2 = x[ioff+j+1]
            y2 = y[ioff+j+1]
            dx = x2 - x1
            dy = y2 - y1
            acc[j] += dx*(dy*dy*third + y1*y2)
        end
        end
    end
    area = sum(map(sum,accs))
    return area
end

Benchmark result (8 threads on 8 cores, Float64):

  120.479 μs (12 allocations: 33.38 KiB)

Comments:

benchmark shows that problem fits in L3 cache on my system, which might be cheating
above result required a quiet system (e.g., no Firefox running)
it takes some work to avoid cache contention - the overhead makes this slower for 1 thread than post 8 above
LLVM does the floating-point SIMD here even without the SIMD.jl package
it would require gymnastics to use SIMD wide loads with the Point structures; without them there is a severe penalty vs. vectors of floats.

purplishrock · May 14, 2017, 2:14pm

What’s up with that struct definition ?
i was looking in the Julia docs for it and can’t seem to find it.

crbinz · May 14, 2017, 3:05pm

It’s the new syntax for composite types in julia v0.6.

Glen · May 17, 2017, 2:17am

Hi,

Can anyone provide a GPU or Parallel Accelerator example, etc?

Ralph, your code gives me about a 3.5x speed-up on 4 threads (JULIA_NUM_THREADS=4), very impressive! Can you explain why you chose nsub=512 and looping structure? I don’t think your code is using SIMD at all (from looking at @code_llvm and @code_native) yet with one thread the performance is the same as post 8 with explicit SIMD (2ms). I’m not sure why that is.

As for SIMD with the Vector{Point{T,T}} I was able to produce this:

function integral_simd(v::Vector{Point{T,T}}, ::Type{Vec{8, T}}) where {T}
    N = 8
    @assert length(v) % N == 1
    area = Vec{N, T}(0)
    third = Vec{N, T}(1 / 3)
    @inbounds for i = 1:N:(length(v)-1)
        x1 = Vec{N, T}((v[i+0].x, v[i+1].x, v[i+2].x, v[i+3].x, v[i+4].x, v[i+5].x, v[i+6].x, v[i+7].x))
        y1 = Vec{N, T}((v[i+0].y, v[i+1].y, v[i+2].y, v[i+3].y, v[i+4].y, v[i+5].y, v[i+6].y, v[i+7].y))
        x2 = Vec{N, T}((v[i+1].x, v[i+2].x, v[i+3].x, v[i+4].x, v[i+5].x, v[i+6].x, v[i+7].x, v[i+8].x))
        y2 = Vec{N, T}((v[i+1].y, v[i+2].y, v[i+3].y, v[i+4].y, v[i+5].y, v[i+6].y, v[i+7].y, v[i+8].y))
        dx = x2 - x1
        dy = y2 - y1
        area += dx * (dy * dy * third + y1 * y2)
    end
    return sum(area)
end

The speed is 2.2 ms which is about 10% slower than separate x and y vectors (post 8) but still a decent speed-up vs without SIMD (post 7). I couldn’t figure out a way to make N a variable without degrading the performance (eg using Vec{N,T}(ntuple(i->v[i].x, N)) is about 500x slower).

Glen

Ralph_Smith · May 17, 2017, 3:37am

The nested loops (“tiling” in the parallel programming lingo, I think) get working variables into the lower level caches. 512 was a guess based on my experience with the machines I use, and worked well enough here.

You don’t see the SIMD instructions because the real work is done in a closure written by the threading macro, which is not shown by the code tools. The inner loop should be almost the same code as is generated by the serial version (dropping the @threads macro), where the SIMD instructions should be visible.

Topic		Replies	Views
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018
Usage of lazy vector instead of function Performance	22	946	November 7, 2022
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7024	January 29, 2019
Simpsons Rule New to Julia question , integral , algorithm	23	3167	September 5, 2022
Optimise calculation with NxNxMxM matrix Performance	17	655	June 22, 2023

Speed up this code game

Related topics