Representing Nullable Values

jameson · July 12, 2017, 4:32am

They are actually very different. The return type is both T and Null (not either). In a statically dispatched language, the argument type is also both (either with some form of widening or pre-computed dispatch match table). But in Julia, runtime dispatch means that the argument type must be one or the other, and cannot be both (except insofar as users can try to emulate it with Core.Inference.return_type, but that’s very difficult to get right and fairly unpredictable, so we try to only use it only when it is unavoidable - such as the empty map case).

it’s not possible to define other types that act like a named tuple

To repeat myself, “no, this is mostly not a technical concern, because we can certain define a mapping of corresponding APIs”. However, I don’t want to add another API to make existing Associative types act like named tuples. I want named tuples to (as much as is logical) act like existing Associative types. Why is this so unreasonable a request?

However the main purpose of named tuples is to represent names at the type level, in order to remove them from the data path as much as possible, so you can efficiently have a huge number of tuples with the same names.

Again, to repeat myself, “the proposed usage examples appear to be biased towards exploiting implementation details”. There’s nothing fundamental about Dict which prevents the names from being represented at the type level and removed from the data path as much as possible. Indeed, there’s even a short gist on one of your PRs showing how to do it in a handful of lines of code. Efficiently having huge numbers of tuples with the same names also doesn’t logically following from this statement. As a counter-example, just for the sake of demonstration, it could be structured as a Tuple{svec(:names...), values...}. This incurs the cost of knowing about one extra pointer, but that’s a fairly inconsequential cost overall (which could likely even be elided in various circumstances). There’s lots of other reasons this representation would not be ideal, but “inefficient use of memory” should not be one of them, since the optimal memory layout is usually something with better memory packing anyways – such as IndexedTables or a Tuple of Arrays (DataFrame of Columns).

jeff.bezanson · July 12, 2017, 5:33pm

I want named tuples to (as much as is logical) act like existing Associative types.

Yes, they can and will implement the majority of the Associative interface. See the NamedTuples.jl package, which has haskey, keys, getindex, etc. The only difference is key=>value pair iteration, because it’s very annoying to have the interface throwing the names at you when you almost never want to consume them that way.

There’s nothing fundamental about Dict which prevents the names from being represented at the type level

That’s true, you can have a FixedKeyDict. I’ll try to exhaustively list the differences between named tuples and dicts as I see it:

Iterating k=>v pairs. I think if we could do it over again, I wouldn’t make dicts work this way. Pairs are awkward because they’re not the data you care about, they simply reflect the fact that you chose to put your data in a dict.
Special syntax for symbol keys: a.b, (a=1, b=2), {x, y}. Restricting to symbol keys makes that syntax correspond to the type’s supported operations; otherwise you’d have to switch syntax for different types of keys. As is often the case, adding the right restriction can improve usability.
No single value type. Like tuples, named tuples track the type of each element separately, so there is no good V to put in Associative{K,V}.
Covariance. We need named tuples to be covariant, and supporting that is much easier than adding general covariance.
Simpler reflection. Named tuples have fields the same way that tuples and struct types do, so nfields, fieldnames, etc. work as expected, and the layout is the same. This doesn’t usually matter, but it’s just nice to have this kind of parity at all levels.

This seems like sufficient justification for a separate data structure to me.

jameson · July 12, 2017, 8:42pm

I can think of many languages where iteration works like Julia (Java, Python, C++, Rust). The only language that comes to mind that does something else is Javascript.

Yes, but the alternative is that you have to switch syntax (for example, from getfield to getindex) for different types of Associative-like types. This seems like a particularly weak argument to use when proposing adding a new object with special syntax that can only be used by that type. That said, I’m not arguing against adding some new sort of type to represent (a=1, b=2) / keyword arguments in a more useful way, just questioning how that new type should behave.

I don’t know that this is actually a benefit. If we want users to be able to have other types that wrap or act like NamedTuples, then encouraging using builtin functions for their primary API would be a distinct disadvantage over using the existing generic names for these properties (length, keys, etc.), since it conflates representation with behavior (violates the “separation of mechanism and policy” principle).

jeff.bezanson · July 12, 2017, 9:07pm

special syntax that can only be used by that type

As I said, I’m fine with dot overloading. I’m not hoarding dots here.

I’ve laid out pretty clearly what I want from named tuples, and it includes the fact that a.b will be used to access them 90% of the time. That (and other features listed above) separates them from other associative collections.

So I don’t fully buy the argument that this is a dict-like type, and therefore must be exactly like other dict-like types. Maybe it’s not dict-like; maybe it’s NamedTuple-like! It can still implement many dict functions, since we’re flexible like that, but that doesn’t mean it needs to be easy to use named tuples and Dicts as drop-in replacements for each other. I’m all for having a FixedKeyDict type, like StaticArray, that can be used as an optimization of Dict as appropriate. But Dicts and named tuples seem to be used in fairly distinct contexts. E.g. I used NamedTuples as a replacement for tuples, not for dicts.

I would be 80% fine with implementing dot overloading, and then implementing named tuples on top of that. It was just surprisingly easy to reuse the existing struct machinery for this purpose, and I would argue elegant in a way.

StefanKarpinski · July 12, 2017, 9:10pm

You seem to be operating in some shadow world where Julia is a dynamic language for arguments but a static one for return types. The core confusion here may stem from the fact that “return type” doesn’t actually make sense for a method – it only makes sense for a particular invocation of a method, which has a only a single concrete return type: the type of the actual value returned by that invocation of the method. The best you can do for a method in general is union over all the possible invocations and call that “the return type” – in which case, yes, it would be a union, of course. Which brings me back to asking: so what?

jameson · July 12, 2017, 11:37pm

The actual return type is, of course, the most important consideration for the algorithm being implemented. However, the static return type is the source of the performance. This statement has a dual for arguments, since it is generally possible to make statements about both the static and dynamic types of the arguments, and the performance considerations are semi-related.

StefanKarpinski · July 12, 2017, 11:48pm

I still don’t know what you’re getting at… Why is any of this a problem?

jameson · July 13, 2017, 1:16am

It’s a performance asymmetry: the trait that makes union return fast (and thus generally preferable to the more cumbersome Nullable struct) does not generalize to arguments (where being able to describe something as Nullable is required for performance).

andyferris · July 14, 2017, 11:23am

Isn’t it the job of inference to decide if e.g. it’s better to widen arguments and have a less specialised function? It’s the compiler that has this asymmetry, not the language.

davidanthoff · July 14, 2017, 1:31pm

The history of Nullable is littered with PRs and ideas on how to make them work better in a data science context that were never merged/implemented. I think many of them were not taken up because folks felt those ideas were too magical for a software engineering Nullable type (and I strongly agree with that). I don’t have that constraint with DataValue, i.e. I’m just picking up many of these ideas. A second difference is that my array type DataValueArray does not use the higher order approach to lifting in functions like map and broadcast that NullableArray pioneered. Instead, things like map and broadcast work exactly like for a normal array (and the lifting happens at the scalar level, i.e. via white-list like methods defined on DataValue). Plus, I have a couple of other things that I believe will help a lot make things smooth. This is kind of a high-level overview, probably best to just wait until I have something ready to show before we start discussing the details

That was the recommended approach in John Myles White’s julep on missing data and analyzed extensively there.

And just to be entirely clear: you are proposing that the data stack uses Union{T,Null}, not Union{Value{T},Null}, right? It was not clear to me from this thread whether the folks that would like to see the extra safeguards one gets from Value would want that for the data stack usage or not.

Well, it exists, but is it really required? There is also an “untyped” missing value in DataValues.jl, namely const NA = DataValue{Union{}}(). If the optimizations for small union types in base would also work for inferred return types like Union{T,DataValue{Union{}}} and Union{DataValue{T},DataValue{Union{}}} it would be great because folks could circumvent the requirement for a typed missing value in many cases.

There is one other issue I raised here somewhat, but that I also want to bring up in this thread: it seems genuinely unclear to me at this point what the “best” memory layout for columns with missing data in a DataFrame like structure is. For example, the missingness mask could be expressed either as a tight bitmask, or a byte sized type tag per element (which seems the current plan in base). I talked with @Jameson about this at juliacon and he pointed out that it really depends on the type of algorithm one is running which of those will be more efficient. Other considerations in this area are emerging standards for in-memory layout for tables like Apache Arrow. It might be really desirable to have a julia table type be super compatible with that initiative. Or not. My point is that at least to me it is not clear at this point which of those considerations is more important/relevant. I also don’t think those things will be sorted out in the julia 1.0 time frame, some of them depend on larger industry trends etc. I think having things like DataValueArray and NullableArray (in packages) is beneficial in such a situation: we don’t have to make decisions about memory layout now for the data stack that will be very difficult to change later on. Note that this does not imply that I think we shouldn’t potentially use something like Array{T?} internally in the implementations of DataValueArray, but I would much prefer that we can use that, then experiment with something that uses BitArrays or even another option.

jameson · July 14, 2017, 7:08pm

Good question. I think it comes down to the line of reasoning that, if it’s not required, then you can’t rely on it being present, but then you can’t do anything with it, so why is it there? Indeed, the small union optimizations will work for this case also, but the runtime cost is a bit higher, since there’s an extra level of unwrapping that the runtime must do (It’s effectively now Union{Value{T}, Null{T}, Null{Union{}}})

The naive memory layout of Array{DataValue{T}} would also suffer greatly compared to Array{T?} (since it is interspersing the data with the nullable bit), but it sounds like you know that already, from your mention of DataValueArray and NullableArray as mechanisms for separating the two fields (data and null bit).

JeffreySarnoff · August 1, 2017, 9:53pm

CardinalDicts

I just wrote a small package that couples a BitVector with a preallocated Vector{T} for some T to allow faster get/set for sequentially available information (or naturally 1:n keyed values) where any values may be or become unavailable or opaque or null (the sematics are flexible).

The little benchmarking I have done is encouraging.

anon94023334 · August 1, 2017, 10:23pm

@JeffreySarnoff this package is very cool and deserves an ANN post of its own, I think.

Alexander_Belopolsky · October 13, 2017, 7:20pm

Sorry for a naive question, but can someone provide a link to the: “John Myles White’s Julep”? The closest I could find was an abandoned Nullable Redesign PR.

nalimilan · October 13, 2017, 7:42pm

That’s it.

jayschwa · January 20, 2018, 3:35am

As far as I can tell, Union{T, Nothing} is the current way to represent a nullable value in 0.7. I’m curious what happened to the T? syntax. Is it still coming or did it get axed for some reason?

nalimilan · January 20, 2018, 2:11pm

T? won’t be supported in 1.0, as there’s some discussion about whether it should mean Union{T, Missing} or Union{T, Nothing}. The group of users which will complain the most will get the syntax.

mohamed82008 · January 20, 2018, 2:22pm

Will both be equally performant in v1.0?

nalimilan · January 20, 2018, 2:25pm

Yes, their representation is the same (a standard singleton type), and they will use the same special cases so that the Union type is preserved instead of using Any as for other Unions.

foobar_lv2 · January 20, 2018, 5:25pm

[Sorry if this option was already discussed to death and I missed it]

Regarding the possibility of #undef in array{T} for pointer arrays:

What about having two array types, (for lack of better terms) UnDefArray with possible #undef (same behaviour as now) and one DefArray{T}, where the latter assumes that all entries are valid?

That way we would get rid of implicit isassigned checks, allow APIs to specify whether they can handle the possibility of #undef, and as a bonus mark access as !dereferenceable, allowing the compiler to re-order array access (no need to ensure that the order of segfaults is correct, since they cannot happen; as far as I understand, we currently don’t do this). People could allocate an UnDefArray, populate it and then transform it into a DefArray (single write to type-tag, with optional verification that all values are actually assigned). push! into a DefArray would need to be done carefully, since resize! can introduce #undef.

Also, could pointer(array) become 64-byte dereferenceable? For this one would need to let pointer(array)==pointer_from_objref for empty newly initialized arrays, and one would need to be slightly careful about shrinking arrays from the beginning.

Topic		Replies	Views
Announcement: An Update on DataFrames Future Plans Data announcement	41	9249	December 27, 2017
Missing data and NamedTuple compatibility Internals & Design	92	10644	April 2, 2018
Getting our act together in the data ecosystem Data	4	1788	July 4, 2017
DataTables or DataFrames? Data question	32	15379	November 19, 2018
Missing or NaN General Usage	26	12342	August 1, 2018

Representing Nullable Values

Related topics