Nullable vs Union

question

#1

julia has a dedicated type Nullable. however, the language naturally supports types like Union{Void, Int}, where Int could be any type. what is the benefit of having Nullable?


#2

The function that always returns Nullable{Int} is type stable where as a function that returns Union{Int,Void} is not.


#3

In addition to the desirable performance optimizations that inspired its design, having Nullable be its own concrete type means that there are values of that type and that you can dispatch on the distinct type of those values.

In contrast, if you were to work with Union{S, T}, then you could not have the following set of functions get dispatched on all in the same program:

  • f(x::S) = 1
  • f(x::T) = 2
  • f(x::Union{S, T}) = 3

The last definition is less specific than the first two because Union isn’t a concrete type, so it would never be called.

This is, in my opinion, why Julia’s Union is not really a sum type in the sense found in some languages. The type of a binding can be a Union, but the type of a value never is. Because dispatch is determined by the types of values, not having any values of type Union is a potential semantic limitation relative to Nullable.


#4

type stability: is this a real difference though? if i understand correctly, Nullable{Int} is a pointer to a Bool and an Int. while Union{Void, Int} is a pointer to a tag and an Int (boxed Int)


#5

about being a concrete type, i see the difference. however, why would we want to give a dedicated Union parameter if we already have a Void case and a T case? does that make practical sense? isn’t it a benefit rather than a drawback? the branch can be pushed back to compile time, and the resulting code might be better optimized. just thinking out loud here.


#6

I had been thinking about this a little of late also. The problem with Union is that the lack of type stability automatically leads to the full multiple dispatch algorithm done dynamically, and as @johnmyleswhite points out there is no way at the moment to “ask” the compiler to optimize that (however my instinct is that automatic compiler optimizations for Union would be a more user-friendly route than hand written f(::Union{A,B}) specialized methods).

Nullable{T} is kind-of like a container for Union{T, Null{T}} (where Null{T} is something I just made up) but it is probably better to think of it as container with 0 or 1 values of type T inside. However, I think sometimes you do want what I’d call “Null{T}()”, and sometimes you really do want Union{A,B} to represent your data. So I have started to make a package called Switches.jl which defines Switch{A,B} - a container which contains either A or a B and so a bit like a Union{A,B}, but with very fast dispatch under e.g. map and broadcast (dot-call syntax).

So you could have Switch{Int, Void} to organize fast dispatch for @pint’s opening question. I did play around and went further and defined the singleton-type Unkown{T} (representing some unknown value of type T) and Optional{T} as a typealias for Switch{T, Unknown{T}} which is quite a lot like Nullable{T} but hopefully a bit more data friendly (e.g. IHMO it’s easier to organize 3-valued Boolean logic without impacting the semantics of Nullable).


#7

ah, now i see a big diff.

the Void part of the Union does not know what the other part could have been. in particular

f(x::Nullable{Int})
f(x::Nullable{Float64})

is a good definition, but

f(x::Union{Void, Int})
f(x::Union{Void, Float64})

is not


#8

however based on Switches, this can be fixed by

immutable Null{T} end

Null(T::Type) = Null{T}() # allows nice Null(Int)

typealias AlternativeNullable{T} Union{T, Null{T}}

but i understand that it will still result in dynamic dispatch, even if the signature has the same union type.


#9

Cool work. If you’re interested in reading more of the CS literature on this topic, this approach is exactly what a sum type is meant to do: https://en.wikipedia.org/wiki/Tagged_union


#10

Yes @pint, AlternativeNullable is what I was calling Optional, and Null{T} was Unknown{T}. The “type” shortcut seems sensible.

If you treat it as a container that uses map and broadcast or dot-call (e.g. abs.(optional)), then in this case “dynamic dispatch” is a Boolean-based branch operation, which is almost free compared to Julia’s full multiple dispatch algorithm (and is similar to what it would cost in C/C++ using explicit branching or vtables). You can’t get around some dynamic dispatch here since you only know the type (and therefore the specialized method) at run time.


#11

Thanks @johnmyleswhite for the link! I had no idea what to call it to be honest. Variant seems to be in the spirit of Switch, or TaggedUnion{A,B} might be reasonably clear to Julians (would many people be familiar with “sum types”?)


#12

random thought: the compiler could optimize it to some degree. for batch operations on unions, it could find the matching methods for all possible member types, and then branch on the actual type. however, with large unions ( Union{Number, AbstractRange} ) and multiple function parameters, it gets out of hand quickly.


#13

Right, at Juliacon this idea was floated by/with the core devs. The idea proposed was that for Unions of two concrete types, one could handle dispatch differently.

As a simple implementation, inference.jl could modify lines of code such that

f(a::Union{A,B})

becomes

if isa(a, A)
    f(a::A)
else
    f(a::B)
end

(this example comes directly from @timholy, who noted the second code runs significantly faster than the first - i.e. more like Python speed rather than Octave speed).

As a more complex thing, you could implement Union{A,B} internally as a variant type with special dispatch rules, a different kind of (“immutable”/stack allocatable) box, and so on, to get even better performance (on par or better than my Switch), but I think that would be more work. The advantage of all of this being that this becomes invisible to the user, meaning they don’t need to think about special wrapper types and so-on.


#14

As a simple implementation, inference.jl could modify lines of code such that

Yes, it does this now (as of v0.5). There’s still many more optimizations though to get the compiler to emit the same code for Union as your Switch type. But I had a look through your code for Switch, and agree that’s what it should do. I think that package may be for Union what FastAnonymous was for closures :slight_smile: .

If you’re interested in working on that part of the codebase, I think a good place to start would be to get codegen to handle stack allocation of Unions as well as Switch. I think that roughly corresponds to updating isbits_spec at https://github.com/JuliaLang/julia/blob/0ba4ea72428b49f30352e4d1fb4b077d0a76b716/src/codegen.cpp#L723 and then flowing the change out from there so that boxed, emit_unboxed, and emit_assignment know how to handle it (and probably also add a bit to jl_cgval_t to track which representation it is using).

Alternatively / similarly, you could start from type layout and make the equivalent changes there (https://github.com/JuliaLang/julia/blob/0ba4ea72428b49f30352e4d1fb4b077d0a76b716/src/alloc.c#L1028). However, there might be more places that this touches in the codebase, since you would need to update both the runtime code (builtins.c) and the code generator optimizations (codegen.cpp / cgutils.cpp).


#15

Thanks Jameson! :slight_smile:

I appreciate the tips - not sure if I will have sufficient time to complete such work in the next couple of weeks but it certainly looks interesting. I’m still getting around the fact that you are saying there are two disjoint ways of implementing this. Is the difference that the first way only affects fully compiled code whereas the second affects the layout and so-on even in the “interpreted” code? Are there any other pros or cons to be aware of with pursuing one or the other?


#16

They are independent optimizations. The latter affects the storage behavior, the former affects the computational behavior. The former is mostly a hidden internal optimization. The latter also affects the C layout and need to be documented. Both are needed for good performance characteristics.


#17

OK, thanks again. After going away and thinking about it a while this seems clear to me now.

I have a question but I’ll move it over to the Development section.