Speed up this code game


#1

Hi,

Brag points are up for grabs again. How can this code be made faster?

using BenchmarkTools
using Base.Test
struct Point{X,Y}
    x::X
    y::Y
end
function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m = (p2.y - p1.y)/dx
        b = p1.y
        area += m/3*m*(dx*dx*dx) + b*m*(dx*dx) + b*b*(dx)
        p1 = p2
    end
    return area
end
# Test code
N = 1_000_000
pwl = [Point(x, sin(x)) for x in linspace(0, pi, N)]
@test integral_squared(pwl) ≈ pi/2  atol=0.00001

println("Benchmarking:")
@btime integral_squared(pwl)

Brag points (genuine bullion to wear around your neck):
10: 10% faster
100: 25% faster
1000: 50% faster

Being it is not an easy challenge, the points are high. Most the time is spent on the area += line. The instructions per cycle is about 0.65 (according to tiptop) and it could go as high as 4 on most systems so there is room for improvement. Hopefully we’ll learn something new.

Thanks,

Glen


#2

That was easy. You need a Haswell+ (or any other processors with FMA support (AVX2))

The following small tweak changes the execution time on my laptop from 2.65ms to 1.61ms.

function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    @inbounds for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m = (p2.y - p1.y) / dx
        b = p1.y
        area += @evalpoly(dx * m, b * b, b, 1 / 3) * dx
        p1 = p2
    end
    return area
end

#3

FWIW, I get an IPC of 2.24 and 2.47 for the old and new version on my laptop. On an old system that doesn’t support FMA, the old version got an IPC of 0.68 and execution time of ~22ms, the new version got an IPC of 0.97 and execution timem of ~12.4ms.


#4

This is about 4-5 times faster on my computer.

function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    third = 1/3
    for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m = (p2.y - p1.y)
        b = p1.y
        area += dx * (m * m * third + b * (m + b))
        p1 = p2
    end
    return area
end

#5

Avoid divisions! Especially, avoid dividing by something which you later multiply by again. :wink:

const third = 1/3
function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    area = zero(oneunit(X)*oneunit(Y)*oneunit(Y))
    p1 = v[1]
    @inbounds for i in 2:length(v)
        p2 = v[i]
        dx = p2.x - p1.x
        m  = p2.y - p1.y
        b  = p1.y
        area += (third*m*m + (m+b)*b)*dx
        p1 = p2
    end
    return area
end

Note, code_warntype told me that T is Any though the type of area was inferred correctly. So I rewrote the initialisation of area, but this makes no performance difference.

Timings:

  • original: 3.411 ms (1 allocation: 16 bytes)
  • @yuyichao’s: 2.468 ms (1 allocation: 16 bytes)
  • mine: 1.748 ms (1 allocation: 16 bytes)

Version Info:

Julia Version 0.6.0-rc1.0
Commit 6bdb395 (2017-05-07 00:00 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel® Core™ i5-7200U CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)


#6

As a curiosity, here’s the impact of a few tweaks on @traktofon’s code (0.6 beta, Haswell processor)

  • @fastmath shaves another 10% (at the expense of a slightly different result)
  • precomputing 1/3 as “third” has no effect
  • removing p1 and p2 to adress v directly has no effect
  • using muladd has no effect

#7

Thanks everybody. Lots of points taken! I found that the last term in the area equation can be further simplified. Here’s what I have so far:

function integral_squared(v::Vector{Point{X,Y}}) where {X,Y}
    T = eltype(oneunit(X)*oneunit(Y)*oneunit(Y))
    area = zero(T)
    p1 = v[1]
    x1 = p1.x
    y1 = p1.y
    third = 1/3
    for i in 2:length(v)
        p2 = v[i]
        x2 = p2.x
        y2 = p2.y
        dx = x2 - x1
        dy = y2 - y1
        area += dx*(dy*dy*third + y1*y2)
        x1 = x2
        y1 = y2
    end
    return area
end

My IPC is now about 1.8 and instead of 14.5 ms it is down to 3 ms :tada:. Now that the math has no divisions and more symmetric, could SIMD.jl be used to vectorize this further? Looking at the docs I’m not sure how to map this into SIMD.

Glen


#8

Here is something to get you started with SIMD.jl

using SIMD
function integral_simd(x::Vector{T}, y::Vector{T}, ::Type{Vec{N, T}}) where {N, T}
    @assert length(x) == length(y)
    @assert length(x) % N == 1
    area = Vec{N, T}(0)
    third = Vec{N, T}(1 / 3)
    for i = 1:N:(length(x)-1)
        x1 = vload(Vec{N, T}, x, i)
        y1 = vload(Vec{N, T}, y, i)
        x2 = vload(Vec{N, T}, x, i+1)
        y2 = vload(Vec{N, T}, y, i+1)
        dx = x2 - x1
        dy = y2 - y1
        area += dx * (dy * dy * third + y1 * y2)
        x1 = x2
        y1 = y2
    end
    return sum(area)
end

N = 1_000_001
x = map(Float32, collect(linspace(0, pi, N)))
y = sin.(x)
@test integral_simd(x, y, Vec{8, Float32}) ≈ pi/2  atol=0.00001
@btime integral_simd(x, y, Vec{8, Float32})

Edit: Remove the x1 = x2 and y1 = y2 lines. They mistakenly survived from an earlier iteration.


#9

That is a very helpful example, thank-you. I didn’t think SIMD would be that easy. I guess having Point{X,Y} would not work with SIMD?

Here are my timings from this thread:

Original: 14.5 ms
Post 7 + @inbounds: 3.1 ms
Post 8 64-bit: 2.0 ms (SIMD)

That is over a 7x speed-up :tada:! Also if 32-bit SIMD is acceptable then the time shrinks to 0.73 ms (about 20x faster).

I’ve learned a lot. Thanks everybody.

Glen


#10

A multi-threading version, just for completeness.

using Base.Threads

function integral_mt{T}(x::Vector{T},y::Vector{T})
    area = zero(T)
    third = T(1/3)
    nsub = 512
    nouter = ceil(Int,length(x)/nsub)
    nt = nthreads()
    accs = [zeros(T,nsub) for i in 1:nt]
    @threads for i in 0:nouter-1
        @inbounds begin
        id = threadid()
        acc = accs[id]
        ioff = nsub * i
        ninner = ifelse(i==nouter-1,length(x)-ioff-1,nsub)
        @simd for j=1:ninner
            x1 = x[ioff+j]
            y1 = y[ioff+j]
            x2 = x[ioff+j+1]
            y2 = y[ioff+j+1]
            dx = x2 - x1
            dy = y2 - y1
            acc[j] += dx*(dy*dy*third + y1*y2)
        end
        end
    end
    area = sum(map(sum,accs))
    return area
end

Benchmark result (8 threads on 8 cores, Float64):

  120.479 μs (12 allocations: 33.38 KiB)

Comments:

  • benchmark shows that problem fits in L3 cache on my system, which might be cheating
  • above result required a quiet system (e.g., no Firefox running)
  • it takes some work to avoid cache contention - the overhead makes this slower for 1 thread than post 8 above
  • LLVM does the floating-point SIMD here even without the SIMD.jl package
  • it would require gymnastics to use SIMD wide loads with the Point structures; without them there is a severe penalty vs. vectors of floats.

#11

What’s up with that struct definition ?
i was looking in the Julia docs for it and can’t seem to find it.


#12

It’s the new syntax for composite types in julia v0.6.


#13

Hi,

Can anyone provide a GPU or Parallel Accelerator example, etc?

Ralph, your code gives me about a 3.5x speed-up on 4 threads (JULIA_NUM_THREADS=4), very impressive! Can you explain why you chose nsub=512 and looping structure? I don’t think your code is using SIMD at all (from looking at @code_llvm and @code_native) yet with one thread the performance is the same as post 8 with explicit SIMD (2ms). I’m not sure why that is.

As for SIMD with the Vector{Point{T,T}} I was able to produce this:

function integral_simd(v::Vector{Point{T,T}}, ::Type{Vec{8, T}}) where {T}
    N = 8
    @assert length(v) % N == 1
    area = Vec{N, T}(0)
    third = Vec{N, T}(1 / 3)
    @inbounds for i = 1:N:(length(v)-1)
        x1 = Vec{N, T}((v[i+0].x, v[i+1].x, v[i+2].x, v[i+3].x, v[i+4].x, v[i+5].x, v[i+6].x, v[i+7].x))
        y1 = Vec{N, T}((v[i+0].y, v[i+1].y, v[i+2].y, v[i+3].y, v[i+4].y, v[i+5].y, v[i+6].y, v[i+7].y))
        x2 = Vec{N, T}((v[i+1].x, v[i+2].x, v[i+3].x, v[i+4].x, v[i+5].x, v[i+6].x, v[i+7].x, v[i+8].x))
        y2 = Vec{N, T}((v[i+1].y, v[i+2].y, v[i+3].y, v[i+4].y, v[i+5].y, v[i+6].y, v[i+7].y, v[i+8].y))
        dx = x2 - x1
        dy = y2 - y1
        area += dx * (dy * dy * third + y1 * y2)
    end
    return sum(area)
end

The speed is 2.2 ms which is about 10% slower than separate x and y vectors (post 8) but still a decent speed-up vs without SIMD (post 7). I couldn’t figure out a way to make N a variable without degrading the performance (eg using Vec{N,T}(ntuple(i->v[i].x, N)) is about 500x slower).

Glen


#14

The nested loops (“tiling” in the parallel programming lingo, I think) get working variables into the lower level caches. 512 was a guess based on my experience with the machines I use, and worked well enough here.

You don’t see the SIMD instructions because the real work is done in a closure written by the threading macro, which is not shown by the code tools. The inner loop should be almost the same code as is generated by the serial version (dropping the @threads macro), where the SIMD instructions should be visible.