# Slower indexing on custom type for `Complex{64}`?

Hi,

I have a custom type, wrapping a matrix, for which I have defined the indexing behaviour.

The interesting fact is that indexing on the custom type is slower by 50% than indexing on the underlying matrix directly only when the `eltype` of the underlying array is `Complex{Float64}`. See the benchmark below.

``````using BenchmarkTools

struct Foo{T<:Number, A<:AbstractMatrix{T}}
data::A
end

Foo(data::AbstractMatrix) = Foo{eltype(data), typeof(data)}(data)

@inline function Base.getindex(U::Foo, i::Int)
@inbounds ret = U.data[i]
ret
end

@inline function Base.setindex!(U::Foo, val::Number, i::Int)
@inbounds U.data[i] = val
val
end

Base.eachindex(U::Foo) = eachindex(U.data)

function test(a, b, c, d)
@simd for i in eachindex(a)
@inbounds a[i] = b[i] + c[i]*d[i]
end
a
end

# size
n = 128

for T in [Float16,          Float32,          Float64,
Complex{Float16}, Complex{Float32}, Complex{Float64}]
a = Foo(zeros(T, n, n))
b = Foo(zeros(T, n, n))
c = Foo(zeros(T, n, n))
d = Foo(zeros(T, n, n))

t_foo = @belapsed test(\$a, \$b, \$c, \$d)
t_arr = @belapsed test(\$a.data, \$b.data, \$c.data, \$d.data)

@printf "%s %8.5f μs  %8.5f μs\n" lpad(string(T), 16) 10^6*t_foo 10^6*t_arr
end
``````

The output of this is

``````         Float16  296.075 μs   287.413 μs
Float32    4.462 μs     4.455 μs
Float64   12.278 μs    12.284 μs
Complex{Float16} 1205.071 μs  1192.119 μs
Complex{Float32}   26.366 μs    26.055 μs
Complex{Float64}   46.057 μs    29.107 μs
``````

Has any of you encountered such behaviour for `Complex{Float64}` before, or is this known somehow?

Thanks.

Davide

P.S. This also shows that `Float16` is much slower then other floating point numbers. But this is not my problem!

As another data point, seems that `Complex{Float32}` has the same issue on 0.7-dev.

I do not have access to a v0.7 but here are better benchmarks for v0.5 and v0.6. These are run collecting the best out of 5.

``````# v0.5
Float16  272.847 μs   250.319 μs
Float32    3.970 μs     3.909 μs
Float64   12.078 μs    12.315 μs
Complex{Float16} 1084.366 μs  1063.182 μs
Complex{Float32}   34.780 μs    29.384 μs
Complex{Float64}   28.221 μs    28.357 μs
``````
``````# v0.6
Float16  294.426 μs   287.436 μs
Float32    3.676 μs     3.865 μs
Float64   12.290 μs    12.071 μs
Complex{Float16} 1205.071 μs  1191.981 μs
Complex{Float32}   26.366 μs    26.014 μs
Complex{Float64}   46.057 μs    26.855 μs
``````

Smaller example that gets rid of having to define all the indexing on the type:

``````function test2(a, b, c, d)
for i in 1:length(a.data)
@inbounds a.data[i] = b.data[i] + c.data[i]*d.data[i]
end
a
end
``````

This is slower for `Foo` than `test` for `Array`s. So it seems the loads to the `data` field are not hoisted?

1 Like

Yeah, I noticed that as well.

Indexing on the type is needed for using this type in generic code that does not know it can access the `.data` field.

I know, but I was trying to reduce the code to something smaller that still exhibit the performance problem…

2 Likes

I created an issue for this: https://github.com/JuliaLang/julia/issues/23042

Thanks. Could you please elaborate here on the difference in the code you have reported on the github issue? It seems to me from your example that the culprit is the `.data` field not being hoisted out of the loop.

Yeah, it’s the same as manually inlining your getindex and setindex! for Foo.

Did the same tests with a slightly modified `get/setindex`:

``````@inline function Base.getindex(U::Foo, i::Int)
Udata = U.data
@inbounds ret = Udata[i]
ret
end
@inline function Base.setindex!(U::Foo, val::Number, i::Int)
Udata = U.data
@inbounds Udata[i] = val
val
end
``````

and it overcomes the 50% slowdown on Complex{Float64} (there was still a slowdown ~15% or perhaps less). It does not solve the LLVM issue, but might be worth using until then. The slowdown on Complex{Float32} remained substantial.

A similar change in the code in the issue, eliminated the slowdown in all types (but essentially the two benchmarks now tested the same operation).

The test was done on 0.7 (2017-07-22).

Unfortunately, no big difference on my setup.

Fixed on 0.7:

``````Float32          4.52129 μs   4.52586 μs
Float64          14.70700 μs  14.74500 μs
Complex{Float32} 18.98400 μs  19.00700 μs
Complex{Float64} 32.60400 μs  32.55700 μs
``````
2 Likes

Thanks for resurrecting this post with an update. I did similar tests few weeks back, and I remember somehow that there was a regression on Float16 from v0.6. Have you seen anything like that?