Slower indexing on custom type for `Complex{64}`?

gasagna · July 29, 2017, 10:59pm

Hi,

I have a custom type, wrapping a matrix, for which I have defined the indexing behaviour.

The interesting fact is that indexing on the custom type is slower by 50% than indexing on the underlying matrix directly only when the eltype of the underlying array is Complex{Float64}. See the benchmark below.

using BenchmarkTools

struct Foo{T<:Number, A<:AbstractMatrix{T}}
    data::A
end

Foo(data::AbstractMatrix) = Foo{eltype(data), typeof(data)}(data)

@inline function Base.getindex(U::Foo, i::Int)
    @inbounds ret = U.data[i]
    ret
end

@inline function Base.setindex!(U::Foo, val::Number, i::Int)
    @inbounds U.data[i] = val
    val
end

Base.eachindex(U::Foo) = eachindex(U.data)

function test(a, b, c, d)
    @simd for i in eachindex(a)
        @inbounds a[i] = b[i] + c[i]*d[i]
    end
    a
end

# size
n = 128

for T in [Float16,          Float32,          Float64, 
          Complex{Float16}, Complex{Float32}, Complex{Float64}]
    a = Foo(zeros(T, n, n))
    b = Foo(zeros(T, n, n))
    c = Foo(zeros(T, n, n))
    d = Foo(zeros(T, n, n))

    t_foo = @belapsed test($a, $b, $c, $d)
    t_arr = @belapsed test($a.data, $b.data, $c.data, $d.data)

    @printf "%s %8.5f μs  %8.5f μs\n" lpad(string(T), 16) 10^6*t_foo 10^6*t_arr
end

The output of this is

         Float16  296.075 μs   287.413 μs
         Float32    4.462 μs     4.455 μs
         Float64   12.278 μs    12.284 μs
Complex{Float16} 1205.071 μs  1192.119 μs
Complex{Float32}   26.366 μs    26.055 μs
Complex{Float64}   46.057 μs    29.107 μs

Has any of you encountered such behaviour for Complex{Float64} before, or is this known somehow?

Thanks.

Davide

P.S. This also shows that Float16 is much slower then other floating point numbers. But this is not my problem!

kristoffer.carlsson · July 29, 2017, 11:18pm

As another data point, seems that Complex{Float32} has the same issue on 0.7-dev.

gasagna · July 29, 2017, 11:21pm

I do not have access to a v0.7 but here are better benchmarks for v0.5 and v0.6. These are run collecting the best out of 5.

# v0.5
          Float16  272.847 μs   250.319 μs
          Float32    3.970 μs     3.909 μs
          Float64   12.078 μs    12.315 μs
 Complex{Float16} 1084.366 μs  1063.182 μs
 Complex{Float32}   34.780 μs    29.384 μs
 Complex{Float64}   28.221 μs    28.357 μs

# v0.6
          Float16  294.426 μs   287.436 μs
          Float32    3.676 μs     3.865 μs
          Float64   12.290 μs    12.071 μs
 Complex{Float16} 1205.071 μs  1191.981 μs
 Complex{Float32}   26.366 μs    26.014 μs
 Complex{Float64}   46.057 μs    26.855 μs

kristoffer.carlsson · July 29, 2017, 11:26pm

Smaller example that gets rid of having to define all the indexing on the type:

function test2(a, b, c, d)
    for i in 1:length(a.data)
        @inbounds a.data[i] = b.data[i] + c.data[i]*d.data[i]
    end
    a
end

This is slower for Foo than test for Arrays. So it seems the loads to the data field are not hoisted?

gasagna · July 29, 2017, 11:32pm

Yeah, I noticed that as well.

Indexing on the type is needed for using this type in generic code that does not know it can access the .data field.

kristoffer.carlsson · July 29, 2017, 11:32pm

I know, but I was trying to reduce the code to something smaller that still exhibit the performance problem…

kristoffer.carlsson · July 30, 2017, 9:17am

I created an issue for this: https://github.com/JuliaLang/julia/issues/23042

gasagna · July 30, 2017, 10:32am

Thanks. Could you please elaborate here on the difference in the code you have reported on the github issue? It seems to me from your example that the culprit is the .data field not being hoisted out of the loop.

kristoffer.carlsson · July 30, 2017, 10:51am

Yeah, it’s the same as manually inlining your getindex and setindex! for Foo.

Dan · July 30, 2017, 8:28pm

Did the same tests with a slightly modified get/setindex:

@inline function Base.getindex(U::Foo, i::Int)
        Udata = U.data
        @inbounds ret = Udata[i]
        ret
end
@inline function Base.setindex!(U::Foo, val::Number, i::Int)
        Udata = U.data
        @inbounds Udata[i] = val
        val
end

and it overcomes the 50% slowdown on Complex{Float64} (there was still a slowdown ~15% or perhaps less). It does not solve the LLVM issue, but might be worth using until then. The slowdown on Complex{Float32} remained substantial.

A similar change in the code in the issue, eliminated the slowdown in all types (but essentially the two benchmarks now tested the same operation).

The test was done on 0.7 (2017-07-22).

gasagna · July 31, 2017, 9:19am

Unfortunately, no big difference on my setup.

kristoffer.carlsson · July 17, 2018, 9:29am

Fixed on 0.7:

Float32          4.52129 μs   4.52586 μs
Float64          14.70700 μs  14.74500 μs
Complex{Float32} 18.98400 μs  19.00700 μs
Complex{Float64} 32.60400 μs  32.55700 μs

gasagna · July 17, 2018, 1:15pm

Thanks for resurrecting this post with an update. I did similar tests few weeks back, and I remember somehow that there was a regression on Float16 from v0.6. Have you seen anything like that?

Topic		Replies	Views
Iteration/getindex performance of AbstractArray wrapper-types Performance	2	461	January 21, 2021
Performance regression with indexing a `reinterpret`ed array in v0.7 Performance	2	535	July 17, 2018
Custom Array Performance Issues General Usage	8	443	September 30, 2019
Sum and other functions on custom subtype of AbstractArray 1000 slower than on normal Arrays General Usage performance	4	519	June 5, 2018
How to implement efficient custom array Performance	4	719	March 23, 2019

Slower indexing on custom type for `Complex{64}`?

Related topics