Performance of custom `Vec` type versus `SVector{3, Float64}`

I’m having trouble replicating the performance of SVector{3, Float64} with my own custom Vec type. Here’s the code:

using LinearAlgebra
using StaticArrays
using BenchmarkTools

struct Vec
    x::Float64
    y::Float64
    z::Float64
end

Vec(u) = Vec(u[1], u[2], u[3])

function Base.iterate(v::Vec, state=1)
    if state == 1
        v.x, 2
    elseif state == 2
        v.y, 3
    elseif state == 3
        v.z, 4
    else
        nothing
    end
end

Base.length(v::Vec) = 3
Base.eltype(::Type{Vec}) = Float64
Base.IteratorSize(::Type{Vec}) = Base.HasLength()
Base.IteratorEltype(::Type{Vec}) = Base.HasEltype()

vs = [SVector{3}(rand(3)) for _ in 1:10000]
ps = [Vec(svec) for svec in vs]

x = SVector(2.0, 3.0, 4.0)
y = Vec(2, 3, 4)

function foo(vs, x)
    dot.(vs, Ref(x))
end

@btime foo($vs, $x);
@btime foo($ps, $y);

And here are the benchmark results:

julia> @btime foo($vs, $x);
  2.674 μs (2 allocations: 78.17 KiB)

julia> @btime foo($ps, $y);
  16.559 μs (2 allocations: 78.17 KiB)

Am I doing something wrong? Is there anything easy I can do to improve the performance of my Vec type? One of the guiding principles of the Julia language is that users should be able to get the same performance with custom types that they would get from Base types. However, that doesn’t seem to be the case with StaticArrays (well, I know that StaticArrays is not in Base, but it’s the same idea). They seem to have some extra magic to squeeze out the maximum performance from SVector. Or maybe it’s just that the compiler is better at optimizing NTuples?

Why not make it an AbstractVector?


struct Vec <: AbstractVector{Float64}
    x::Float64
    y::Float64
    z::Float64
end

Vec(u) = Vec(u[1], u[2], u[3])

function Base.getindex(v::Vec, i)
    if i == 1
        v.x
    elseif i == 2
        v.y
    elseif i == 3
        v.z
    else
        throw(BoundsError())
    end
end

Base.length(::Vec) = 3
Base.size(::Vec) = (3,)
Base.eltype(::Type{Vec}) = Float64
Base.IteratorSize(::Type{Vec}) = Base.HasLength()
Base.IteratorEltype(::Type{Vec}) = Base.HasEltype()
Base.IndexStyle(::Type{Vec}) = Base.IndexLinear()

function mydot(x,y)
    s = zero(promote_type(eltype(x),eltype(y)))
    @inbounds @simd for i ∈ eachindex(x,y)
        s += x[i]*y[i]
    end
    s
end

LinearAlgebra.dot(x::Vec,y::Vec) = mydot(x,y)

Compare just the @code_typed dot(y,y) and @code_typed dot(x,x).

2 Likes

Nice, that makes a big difference. The native code appears to be exactly the same now:

julia> @code_native dot(y, y)
	.text
; ┌ @ REPL[13]:1 within `dot`
	vmovsd	(%rdi), %xmm0                   # xmm0 = mem[0],zero
	vmovupd	8(%rdi), %xmm1
; │┌ @ REPL[10]:3 within `mydot`
; ││┌ @ simdloop.jl:77 within `macro expansion` @ REPL[10]:4
; │││┌ @ float.jl:405 within `*`
	vmulpd	8(%rsi), %xmm1, %xmm1
; │││└
; │││┌ @ float.jl:399 within `+`
	vfmadd132sd	(%rsi), %xmm1, %xmm0    # xmm0 = (xmm0 * mem) + xmm1
	vpermilpd	$1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
	vaddsd	%xmm1, %xmm0, %xmm0
; │└└└
	retq
	nop
; └

julia> @code_native dot(x, x)
	.text
; ┌ @ linalg.jl:205 within `dot`
; │┌ @ linalg.jl:218 within `_vecdot`
; ││┌ @ simdloop.jl:77 within `macro expansion` @ linalg.jl:219
; │││┌ @ generic.jl:905 within `dot`
; ││││┌ @ float.jl:405 within `*`
	vmovsd	(%rdi), %xmm0                   # xmm0 = mem[0],zero
	vmovupd	8(%rdi), %xmm1
	vmulpd	8(%rsi), %xmm1, %xmm1
; │││└└
; │││┌ @ float.jl:399 within `+`
	vfmadd132sd	(%rsi), %xmm1, %xmm0    # xmm0 = (xmm0 * mem) + xmm1
	vpermilpd	$1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
	vaddsd	%xmm1, %xmm0, %xmm0
; │└└└
	retq
	nop
; └

And the performance is better, although oddly enough the performance with Vec is still slightly slower on my machine, even though the native code appears to be the same:

julia> @benchmark foo($vs, $x)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.014 μs … 55.250 μs  ┊ GC (min … max): 0.00% … 72.93%
 Time  (median):     5.828 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.344 μs ±  3.844 μs  ┊ GC (mean ± σ):  6.16% ±  9.12%

      ▃█▁                                                     
  ▂▃▄▅███▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▂
  3.01 μs        Histogram: frequency by time        33.8 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

julia> @benchmark foo($ps, $y)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  5.768 μs … 112.114 μs  ┊ GC (min … max): 0.00% … 82.16%
 Time  (median):     6.993 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.903 μs ±   6.630 μs  ┊ GC (mean ± σ):  6.83% ±  7.68%

   ▃▅▇██▇▅▃                                           ▁▁▁     ▂
  ██████████▇▅▄▃▁▁▁▄▃▄▄▁▁▁▁▁▃▃▁▁▁▅▄▅▃▃▃▁▁▁▁▁▁▁▁▁▃▄▃▆▇▇█████▇▇ █
  5.77 μs      Histogram: log(frequency) by time      19.6 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

Version info:

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_EDITOR = vim

Odd. FWIW, this fixes it for me:

julia> @btime foo($ps, $y);
  8.176 μs (2 allocations: 78.17 KiB)

julia> LinearAlgebra.dot(x::Vec,y::Vec) = muladd(x.z,y.z,muladd(x.y,y.y,x.x*y.x))

julia> @btime foo($ps, $y);
  3.958 μs (2 allocations: 78.17 KiB)

julia> @btime foo($vs, $x);
  3.932 μs (2 allocations: 78.17 KiB)

Julia+LLVM really do not like vectorizing outer loops, whenever an inner loop is present.
Thus manually unrolling the inner loop can make a big difference.
Thus, foo, which loops over vs, is SIMD if we manually unroll, but not otherwise. =/

2 Likes

Thanks! That works for me too:

julia> @btime foo($vs, $x);
  2.899 μs (2 allocations: 78.17 KiB)

julia> @btime foo($ps, $y);
  2.664 μs (2 allocations: 78.17 KiB)

Although I wonder how StaticArrays avoids that SIMD problem, because it looks like their implementation is pretty similar to your first implementation:

1 Like