Findfirst for Dicts with `nothing` keys

Liso · March 1, 2018, 8:53pm

Is there plan to fix this function?

julia> d=Dict(nothing=>1)

julia> z = findfirst(x->x==1, d)  # z === nothing !

julia> d[z]  # but nothing is correct key
1

julia> z = findfirst(x->x==0, Dict(nothing=>1))  # or could be incorrect key!

One possible solution could be return vector like findall… (maybe I am wrong but think that compiler could make it without performance penalty)

(BTW it could also make from isempty universal test for success)

mbauman · March 1, 2018, 9:13pm

I’m gonna channel Steve Jobs here: Don’t use nothing to represent something?

We’ve decided to go with a sentinel implementation here. You’ll never be able to find that sentinel, no matter what it is. Yes, it’s a tiny limitation in its generality, but I don’t find it horrible.

Liso · March 1, 2018, 10:10pm

It is horrible. It is not about limitation it is about consistency.

If you insist on special status of nothing then proper solution would be make something like Dict(nothing=>1) strictly forbidden. (means throwing exception). It would be consistent.

Current situation is bad. Suppose to have R big amount of results from measurement. You would like to collect counts for result values. Dict is natural representation of this collection. And than you could ask if in R is any value more than x times:

julia> R = [5,5,5,nothing,nothing,nothing];  

julia> d = Dict([[value, count(i->i==value, R)] for value in R])
Dict{Union{Nothing, Int64},Int64} with 2 entries:
  nothing => 3
  5       => 3

julia> findfirst(i->i>2, d)  # return nothing

You could say that this is coders mistake and they have to know and remember that nothing is special…

I don’t buy it! It would be (if we don’t fix this) language property to support these type of bugs.

For some inspiration what we could cause we could look at List of software bugs - Wikipedia

By returning Vector we could avoid this problems.

[]  # we could return nothing
[nothing]  # we could return `nothing` too
[[]]  # we have no problem with empty vector

jeff.bezanson · March 1, 2018, 10:48pm

How is nothing the result of a measurement? In that particular example I would use missing.

ExpandingMan · March 1, 2018, 11:04pm

The current behaviors of nothing and missing are the result of extensive (by which I mean, truly ad nauseum) discussion and development experience both within the Julia data science community and core Julia. As far as I know none of the widely used packages including DataFrames, CategoricalArrays, IndexedTables, all of the Anthoff-verse, nor any of the other JuliaData, JuliaStats or machine learning stuff would create an array like the R in your snippet above, so I think it’s safe to say that this is a standard that has been gelling for some time now. The fact that nothing and missing have “special” behavior is extremely deliberate. The current behavior of findfirst allows you to dispatch on code which returns Nothing, and I expect this sort of behavior to standardize as well, if it hasn’t to some extent already.

Just as you woulndn’t consider it reasonable to use NULL as a stand-in for just-any-old-thing in C or C++, you should not use nothing this way either.

ScottPJones · March 1, 2018, 11:38pm

We actually did use nothing as a dictionary key, when processing JSON files (i.e. for JSON null values).

Another way of handling this, that would allow using anything (including nothing) as a key, would be to have different types of collection use different values for different types of collections.

For collections that can use an “in-band” sentinel (such as 0 for AbstractString and Vector [not necessarily AbstractVector though, since it could be an OffsetVector or a 0-based vector]), use that, otherwise, return a simple tuple, (true, value) or (false, emptyvalue) (where emptyvalue just needs to be something that is one of the types that the elements of the collection could be, it’s never looked at). That could be hidden by two functions, found(collectionortype, result) and find_result(collectionortype, result).

I believe the code generated for is as fast or faster on current master than using nothing as a sentinel, and is more general.

Liso · March 2, 2018, 4:51am

Weeell! It help me to internally accept that nothing as key will throw error.

Please do it!

Liso · March 2, 2018, 5:11am

I think I understand why ad nauseum! This really complicate language (at least from user perspective) . (I say nothing about necessity or usefulness now! I am open to accept this complication but need more time to thinking)

You probably don’t know what NULL in C++ represents? 0 is probably most used key in C++!
Look for example at: Using NULL in C++? - Stack Overflow

Liso · March 2, 2018, 6:31am

You mean during processing? Because names in object are limited to strings ( https://json.org/ )

And BTW wouldn’t be better to use missing here (if nothing has special language meta meaning) ?

julia> JSON.parse("""[null]""")
1-element Array{Any,1}:
 nothing

ScottPJones · March 2, 2018, 10:13am

In this case, we were keeping track of occurrences of values, null, false, true, etc., so the value itself was used as the key in a Dict. (although we ran into a different problem there, because Dict considers the keys false and 0 to be the same (and true and 1), something we hadn’t at all expected, because if 1 gives an error.

This was something from almost 2.5 years ago (before v0.4 was even released), missing is very recent.
Also, missing has a lot of special meaning, with 3VL. (missing | true => true, but missing & true => missing).

stevengj · March 2, 2018, 1:10pm

Since any value could be a dictionary key (including missing or an array), there is nothing (no pun intended) that findfirst could return on failure that couldn’t collide with a key. You’d have to throw an exception instead, which is conceptually fine but would be much slower.

In the rare (but not impossible) case of dictionaries that may have nothing as a key and you need linear search, the best alternative is probably just to write your own loop, which is not any slower than findfirst and takes 4 lines instead of 1.

for v in values(d) # or (k,v) in pairs(d)
      v == 1 && return something
end
return somethingelse

Tamas_Papp · March 2, 2018, 1:56pm

In the spirit of #23642, IMO the cleanest approach for a custom function would be to do it without sentinel values at all, ie returning a Union{Some{T}, Nothing}.

foobar_lv2 · March 2, 2018, 2:22pm

I’m not so sure. Returning non-bitstype tuples is currently pretty hit-and-miss with respect to avoiding the allocation for the tuple.

Once this https://discourse.julialang.org/t/immutables-with-reference-fields-why-boxed/7706 is fixed (if ever) I totally agree that sentinel values are stupid-- as far as I understand, we should have several free (already invalidated) registers where we can place our bool, and the actual return value can go via sret / other register / whatever the current ABI wants.

Maybe an easier clean fix would be to special case that Tuple{atype, btype} with inferred bitstype never allocates. Of course, unless this is packed into a tuple / array / etc, or this rule would need to be applied recursively. In other words: I am asking for a special ABI for returning pairs. I would even be happy if I was forced to annotate that I want @pair_return_ABI or there was a magic built-in __unpacked_pair that lazily materializes the tuple.

Liso · March 2, 2018, 2:59pm

Really unbelievable! I think it is bug.

Or maybe we could at least change findfirst to something like findkey(value, dict, sentinel=nothing)? (to be universal and to help avoid bug with collision nothing)

ExpandingMan · March 2, 2018, 3:10pm

Sorry, I should have said null_ptr or whatever is the modern, type-correct version of NULL (it’s been a while for me). Even so, if you are reading C++ code you would not expect to see NULL as a key, you would expect to see 0.

I still don’t see how Nothing having special behavior complicates the language. Different types behave differently, it’s a core concept of programming. Should you expect sin(nothing) to attempt to return something? If you think you need another type that behaves like NaN, we have missing.

ScottPJones · March 2, 2018, 3:25pm

I think the problem is that it has been overloaded with a number of different meanings.

for tryparse, it means that it was unable to parse the string (but unfortunately, it can’t also return any useful information about why it was unable to parse the string, for that, you have to raise an error)
to indicate that nothing was returned from a method, which is used to indicate that there is no output (and people have rightly complained that when nothing is being used to indicate something meaningful, it is a pain to have it not printed out)
to indicate that something was not found in some search functions. Note: it’s still very inconsistent, because if you are searching for a regex or string, you get a range, which you need to test to see if it’s empty, but if it is a predicate, (equalto or occursin) it is either an index or nothing. The searchsorted* functions generally return a range, even if not found, because you want to get back the insertion location, if not found.

I think that cases 1 and 3 should use different types, for tryparse, a type that can give information about why the parse failed (maybe simple return the Exception type, instead of throwing it), and for find* that don’t return a range, a notfound instance of a NotFound type (if a singleton must be used).

Tamas_Papp · March 2, 2018, 3:34pm

The hard part about this is designing the API. There is no reason why it can’t / shouldn’t be done in a library (using a different function, of course).

stevengj · March 2, 2018, 3:37pm

It’s intentional behavior, not a bug. true == 1 and isequal(true, 1) and Bool <: Integer are all true in Julia.

There have been lots of discussions about whether it is desirable to treat Bool values as integers. See e.g. Should true == 1? · Issue #15983 · JuliaLang/julia · GitHub, make Bool <: Unsigned (i.e. UInt1) · Issue #19168 · JuliaLang/julia · GitHub, Introduce `UInt1` type to replace misuse of `Bool` · Issue #18367 · JuliaLang/julia · GitHub

Liso · March 2, 2018, 3:44pm

I have no problem with special behavior. I am just saying that if it has to have special behavior it need to be consistently applied.

So in case of findfirst I expect that we:
a) fix function and return vector like findall do
   or
b) forbid nothing as Dict key
   or
c) throw exception in case that nothing key is proper return value
   or
something other which I don’t think is useful to search without consensus that current status is wrong.

ExpandingMan · March 2, 2018, 4:09pm

I’m certainly on-board with making whatever (consistent) changes are needed to make this transparent including forbidding setting nothing as a Dict key. I think this should be as simple as

setindex!(a::AbstractDict, ::Nothing, idx) = throw(ArgumentError("dude, don't do that"))

which should have no performance consequences.

I’m not quite sure what you mean by “throw exception in case that nothing key is proper return value”. I don’t see why findfirst should return an AbstractVector, that seems totally unnecessary.