Choice between abstract type and union of vectors of types to feed functions that operate on vectors of more than one type

I have two data structures, and some functions that have to operate on vectors of those data structures. One option is to define an abstract type as a supertype of both data structures. The other option is to create a new type which is the union of both types. A minimal example is:

abstract type C end
struct A <: C
    a :: Int64
end
struct B <: C
  a :: Int64
end

f( x :: Vector{<:C} ) = x[1].a

# or

AorB = Union{Vector{A},Vector{B}}
g(x :: AorB) = x[1].a

These things are NOT the same, because f accepts vectors with mixed A and B types, but g does not.

In my case, I will never have a vector of mixed types, thus it seems that the Union approach would allow more specialization than the abstract type approach. At the same time, I have tested and, when feeding the functions with vectors of only one of the types, the @code_typed is identical.

Is any of the approaches any better for any reason I cannot foresee?

Additionally, I noted that if I feed the function f with a mixed vector, the code generated does not specialize at all for the type of the a fields of A and B:

julia> y = [ A(1), B(2) ]
2-element Array{C,1}:
 A(1)
 B(2)

julia> @code_typed f(y)
CodeInfo(
1 ─ %1 = Base.arrayref(true, x, 1)::C
β”‚   %2 = Base.getfield(%1, :a)::Any
└──      return %2
) => Any

Seems to me that there is probably a way to inform the compiler that the type of field a of any subtype of supertype C is an Int64. Is there one?

Thank you.

If you want to take advantage of union splitting make it a Union.

3 Likes

I would say that the Union approach has the advantage (that you already perceived) of checking and throwing an error if you ever mix the types in a vector.

If you call isconcretetype over AorB or Vector{<:C} you will get that neither of them is a concrete type. This means their code may be specialized for A or B (note however, that I do not remember if there is a guarantee of specialization). This is different from Vector{C} (which is a concrete type) and will only accept Vector{C} values (not Vector{A} or Vector{B}), which you should explicitly define (with C[] for example) otherwise you will be relying in the compiler inference working the way it works now ([A(1), B(2)] also gives a Vector{C} but I am not sure this is guaranteed to keep this way between releases). In this last case the method has always only one specialization (for the Vector{C} type) but the object has always some overhead (because it already stores boxed instances of A or B).

2 Likes

Does this specializes more than the other options?

f(x) = x[1].a
f(x :: Vector{A}) = f(x)
f(x :: Vector{B}) = f(x)
x = [ A(1), A(1) ]
f(x)

In other words, defining an upper layer of functions that specialize the methods by hand can guarantee specialization any further or the compiler will always figure this out, if that is possible?

Or, in other words: is it a waste of time writing these other methods?

Edit: thinking further, this must be the same that happens with the Union choice.

Good question. I am not 100% sure. I would expect that f(x :: Vector{A}) = ... guaranteed the specialization. This because Vector{A} is more specific than Any (the implicit type of x in f(x) = ...) and therefore the multiple dispatch guarantee the body f(x :: Vector{A}) = ... will be used.
The manual section on function barriers also seem to imply the specialization. @nospecialize (and, consequently, @specialize) documentation use the term β€˜hints’ instead of β€˜guarantee’. I think a Julia commiter would be more sure of it.

I think, in an extreme and hypothetical case, the compiler could inline the method body and find out that an even better optimization is replacing a sequence of specialized chunks of code by a loop calling non-specialized code. But, in such case, I would trust the compiler is doing its best and not try to outsmart it.

1 Like

AFAIK yes, it’s a waste of time.

If you want the actual union-splitting behavior to work, then you need your array to actually have Union{A, B} as its element type. For example:

julia> struct A
         a::Int64
       end

julia> struct B
         a::Int64
       end

julia> f(x::Vector{<:Union{A, B}}) = x[1].a  # this matches Vector{A}, Vector{B}, and Vector{Union{A, B}}
f (generic function with 1 method)

julia> x = Union{A, B}[A(1), B(2)]
2-element Array{Union{A, B},1}:
 A(1)
 B(2)

julia> @code_warntype f(x)
Variables
  #self#::Core.Compiler.Const(f, false)
  x::Array{Union{A, B},1}

Body::Int64
1 ─ %1 = Base.getindex(x, 1)::Union{A, B}
β”‚   %2 = Base.getproperty(%1, :a)::Int64
└──      return %2

You can add an abstract supertype C if you want, but it has no effect on the result.

2 Likes

I may have missed something, but why is everyone mentioning union splitting (OP have never mentioned it)? Also, he said he will never have mixed types inside an array.

From what I could understand the union splitting will essentially do what I did by splitting by hand. Effectively:

julia> struct A 
           a :: Int64
       end

julia> struct B
           a :: Int64
       end

julia> f(x :: Union{Vector{A},Vector{B}}) = x[1].a
f (generic function with 1 method)

julia> x = [ A(1), A(1) ]
2-element Array{A,1}:
 A(1)
 A(1)

julia> @code_typed f(x)
CodeInfo(
1 ─ %1 = Base.arrayref(true, x, 1)::A
β”‚   %2 = Base.getfield(%1, :a)::Int64
└──      return %2
) => Int64

julia> g(x::Vector{A}) = x[1].a
g (generic function with 1 method)

julia> @code_typed g(x)
CodeInfo(
1 ─ %1 = Base.arrayref(true, x, 1)::A
β”‚   %2 = Base.getfield(%1, :a)::Int64
└──      return %2
) => Int64

Not exactly, because in your case the β€œunknown type” is a parameter so it is in fact known at the time of method compilation/specialization. In my understanding, union splitting is often necessary when a binding in a local scope (or a boxed element of a vector) has a value of unknown type (I mean, the type of the value was not known at the time of the method compilation/specialization), it does not seem to be your case.

There I am lost. I will need to study. For others who might be interested, this is probably a nice place to start: Union-splitting: what it is, and why you should care

From my understanding of this text, Union splitting will interpret

f(x :: Union{Vector{A},Vector{B}}) = x[1].a

and expand any call to that f at compile time to something like

if typeof(x) == Vector{A}
   f(x :: Vector{A}) = x[1].a
else
   f(x :: Vector{B}) = x[1].a
end

with a minimal overhead relative to defining the independent functions, except in pathological cases probably.

1 Like

Yes, you are correct, but there are some subtle and important details (maybe you are completely aware of them, but I can not tell by what you said).

For example, if you always call f inside other functions that β€œknow” the type they are passing to f then this does not make a difference.

In the spirit of the example given by Tim, you call f just after calling another function (let us call it g) that could return either Vector{A} or Vector{B} (and the compiler know there is only these two possibilities), so the outer function that is enclosing the calls to both g and f does not really know the type of the value returned by g when it goes to call f. When Julia compile the outer function, it has two options: do union splitting which is the same as creating an if and replicating chuncks of specialized code; do not do union splitting and, consequently, call a complex procedure that will get the correct method and call it through a function pointer (if I am not wrong about the internals).

Also, Tim does not say at any point that his equivalent of our function f needs a signature like f(x :: Union{SomeType,AnotherType}) = .... Theoretically, it could be f(x) = ... what is the same as f(x :: Any) = ..., the question is not how loose f signature is, or how many different specializations the function have, but if in the context where the function is called (and not where it is defined) the compiler know there is only two, three, or four possibilities, and replicating code inside if statements is a better idea than calling the multiple dispatch handler.

With union splitting, compiler can prove stronger guarantees on some operations. Say:

julia> sumlength(x) = sum(length, x)
sumlength (generic function with 1 method)

julia> @code_warntype sumlength([])
Variables
  #self#::Core.Compiler.Const(sumlength, false)
  x::Array{Any,1}

Body::Any
1 ─ %1 = Main.sum(Main.length, x)::Any
└──      return %1

julia> @code_warntype sumlength(Union{Vector, Dict}[])
Variables
  #self#::Core.Compiler.Const(sumlength, false)
  x::Array{Union{Dict, Array{T,1} where T},1}

Body::Int64
1 ─ %1 = Main.sum(Main.length, x)::Int64
└──      return %1

In the first case (call on Vector{Any}), length of an arbitrary object may be overloaded to return anything, so the compiler falls back to boxing the return value.
In the second case, there are two branches for length, both returning Int, so it is safe to assume that the resulting function also returns an Int.

6 Likes

But you said that you never have mixed vectors, so why does that matter?

Unless I am missing something, it looks like you just need something like

f(x::AbstractVector{T}) where {T <: Union{A,B}} = first(x).a

and the only thing that requires restructing T is dispatch. If you never use f on anything else (that would require a different approach for handling x), you don’t even need that.

Union splitting will not need to kick in if you use concretely typed arguments.

1 Like

For this particular case nothing, I was only trying to learn how those things work.

Yes, that is almost what I had done first (although with a different syntax), something like:

MyVectors = Union{Vector{A},Vector{B}}
f(x::MyVectors) = x[1].a

My question was if defining an abstract type had any advantage or disadvantage relative to that option. Specifically I was in doubt if any of these options would lead to any kind of type instability. But, in my particular case, it does not seem so.