Slower indexing on custom type for `Complex{64}`?

Hi,

I have a custom type, wrapping a matrix, for which I have defined the indexing behaviour.

The interesting fact is that indexing on the custom type is slower by 50% than indexing on the underlying matrix directly only when the eltype of the underlying array is Complex{Float64}. See the benchmark below.

using BenchmarkTools

struct Foo{T<:Number, A<:AbstractMatrix{T}}
    data::A
end

Foo(data::AbstractMatrix) = Foo{eltype(data), typeof(data)}(data)

@inline function Base.getindex(U::Foo, i::Int)
    @inbounds ret = U.data[i]
    ret
end

@inline function Base.setindex!(U::Foo, val::Number, i::Int)
    @inbounds U.data[i] = val
    val
end

Base.eachindex(U::Foo) = eachindex(U.data)

function test(a, b, c, d)
    @simd for i in eachindex(a)
        @inbounds a[i] = b[i] + c[i]*d[i]
    end
    a
end

# size
n = 128

for T in [Float16,          Float32,          Float64, 
          Complex{Float16}, Complex{Float32}, Complex{Float64}]
    a = Foo(zeros(T, n, n))
    b = Foo(zeros(T, n, n))
    c = Foo(zeros(T, n, n))
    d = Foo(zeros(T, n, n))

    t_foo = @belapsed test($a, $b, $c, $d)
    t_arr = @belapsed test($a.data, $b.data, $c.data, $d.data)

    @printf "%s %8.5f μs  %8.5f μs\n" lpad(string(T), 16) 10^6*t_foo 10^6*t_arr
end

The output of this is

         Float16  296.075 μs   287.413 μs
         Float32    4.462 μs     4.455 μs
         Float64   12.278 μs    12.284 μs
Complex{Float16} 1205.071 μs  1192.119 μs
Complex{Float32}   26.366 μs    26.055 μs
Complex{Float64}   46.057 μs    29.107 μs

Has any of you encountered such behaviour for Complex{Float64} before, or is this known somehow?

Thanks.

Davide

P.S. This also shows that Float16 is much slower then other floating point numbers. But this is not my problem!

As another data point, seems that Complex{Float32} has the same issue on 0.7-dev.

I do not have access to a v0.7 but here are better benchmarks for v0.5 and v0.6. These are run collecting the best out of 5.

# v0.5
          Float16  272.847 μs   250.319 μs
          Float32    3.970 μs     3.909 μs
          Float64   12.078 μs    12.315 μs
 Complex{Float16} 1084.366 μs  1063.182 μs
 Complex{Float32}   34.780 μs    29.384 μs
 Complex{Float64}   28.221 μs    28.357 μs
# v0.6
          Float16  294.426 μs   287.436 μs
          Float32    3.676 μs     3.865 μs
          Float64   12.290 μs    12.071 μs
 Complex{Float16} 1205.071 μs  1191.981 μs
 Complex{Float32}   26.366 μs    26.014 μs
 Complex{Float64}   46.057 μs    26.855 μs

Smaller example that gets rid of having to define all the indexing on the type:

function test2(a, b, c, d)
    for i in 1:length(a.data)
        @inbounds a.data[i] = b.data[i] + c.data[i]*d.data[i]
    end
    a
end

This is slower for Foo than test for Arrays. So it seems the loads to the data field are not hoisted?

1 Like

Yeah, I noticed that as well.

Indexing on the type is needed for using this type in generic code that does not know it can access the .data field.

I know, but I was trying to reduce the code to something smaller that still exhibit the performance problem…

2 Likes

I created an issue for this: https://github.com/JuliaLang/julia/issues/23042

Thanks. Could you please elaborate here on the difference in the code you have reported on the github issue? It seems to me from your example that the culprit is the .data field not being hoisted out of the loop.

Yeah, it’s the same as manually inlining your getindex and setindex! for Foo.

Did the same tests with a slightly modified get/setindex:

@inline function Base.getindex(U::Foo, i::Int)
        Udata = U.data
        @inbounds ret = Udata[i]
        ret
end
@inline function Base.setindex!(U::Foo, val::Number, i::Int)
        Udata = U.data
        @inbounds Udata[i] = val
        val
end

and it overcomes the 50% slowdown on Complex{Float64} (there was still a slowdown ~15% or perhaps less). It does not solve the LLVM issue, but might be worth using until then. The slowdown on Complex{Float32} remained substantial.

A similar change in the code in the issue, eliminated the slowdown in all types (but essentially the two benchmarks now tested the same operation).

The test was done on 0.7 (2017-07-22).

Unfortunately, no big difference on my setup.

Fixed on 0.7:

Float32          4.52129 μs   4.52586 μs
Float64          14.70700 μs  14.74500 μs
Complex{Float32} 18.98400 μs  19.00700 μs
Complex{Float64} 32.60400 μs  32.55700 μs
2 Likes

Thanks for resurrecting this post with an update. I did similar tests few weeks back, and I remember somehow that there was a regression on Float16 from v0.6. Have you seen anything like that?