How to replace all values in an array with missing values?

There is no Union{Float64,NaNType}, so any float can be a NaN and programmers often but not always have to check for isnan. Is there an argument that NaN is problematic in a similar way as null is?

2 Likes

The difference is that within IEEE arithmetic, NaN has well defined semantics on all operations rather than throwing random errors.

5 Likes

maybe you could have a a look at https://github.com/JuliaData/SentinelArrays.jl

1 Like

This; but also people who write math primitives do have to check for NaN all the time and handle it specially.

5 Likes

due to the way nan propegation works, they often fall out correctly without special handling (but not always).

Depending on what you are doing, NaNs can propagate so that all your data become NaN and you are stuck backtracking to see where the offending NaN was generated that destroyed everything in its wake. In the past, at least, arithmetic with NaN was far slower than with non-NaNs. In an earlier project I worked on, we had code to replace NaN sentinels before to any calculation, even if the calculations didn’t result in NaN propagation corrupting any other values.

Well, maybe corrupting is not the best description. What is better?

mean([rand(1, 100) NaN])
NaN

mean([rand(1, 100) -999999])
-9900.487109109194

A few random comments.

In a way, missing is an “integer”. With integer arrays, one would use missing to represent . . . wait for it . . . missing values. For float arrays, one uses NaN. For strings, one uses the empty string.

Although Julia makes it very easy to using missing wherever one wants to via type unions. Somewhere in the documentation it says something like, “don’t be afraid of union types, they are fast” – i.e., don’t be afraid they’ll slow down your code if you use them.

Microprocessors are designed to process data. They don’t do as well processing no data. The NaN of the FP unit has been pressed into service as a missing value (the standard should be updated to have an explicit missing value in addition to NaNs – following the principle of don’t use one thing for two different purposes). No support on the integer side for missing values. I may still know an Intel Design Fellow; I was thinking of sending him this idea: use two bits behind the sign bit as flags for missing/null/unknown (00=present; 10=missing; 01=null, 11=unknow). Then, for every arithmetic and logic computation, apply tri-value logic rules to these, then set two bits in a flag register with the result. If you could then have a conditional jump based on those register bits, you could have high performance support for missing/null/unknown values. Of course, software using this feature has their integer size reduced to 61 bits (likely not a hardship for most applications). Since we’re unlikely to get 66 bit memory chips anytime soon, it’s one way to retrofit missing value support into the architecture.

1 Like

Yeah, maybe corrupting is not the best term. Neither result is good, of course. I was thinking NaN more of something like the following, and it can make finding original the NaN difficult, but at least you know something went wrong. Blinding processing a sentinel like -999999 may result in relatively undetectable bad data.

a=rand(10)
a[3] = NaN  # From somewhere
@show a ./ mean(a)
a ./ mean(a) = [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN]

Also, I have preferred using NaN as a missing value because choosing a sentinel non-NaN value is not obvious in general purpose code. This is all independent of Julia, and I think we agree, so I’ll stop now.

I don’t think of missing as an Int, but rather a placeholder value with support for tri-valued logic on the language (the special support is the short-circuit logic operators).

It doesn’t have semantics of Int neither its use of memory. Also NaN or and empty String is not necessary a missing values, as any other value could be used as placeholders. Just NaN’s or "" are values not seek for on a lot of applications (rather quite big generalization, specially for empty Strings), so they are useful for those cases.

On floating arithmetic sometimes you prefer NaN poisoning over Missing poisoning, or maybe rather than preferring one over the other you want them along and be able to differentiate them.

1 Like

Yes, I believe too. Happen to be looking into a Sea Ice thickness nc file from some institution that uses -9999 without even advertising it with the FillValue attribute. Shame.

2 Likes

Oracle using “” instead of NULL is a proper pain. Even worse when you read their docs

https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements005.htm

Oracle Database currently treats a character value with a length of zero as null. However, this may not continue to be true in future releases, and Oracle recommends that you do not treat empty strings the same as nulls.

5 Likes

Julia doesn’t quite avoid it though, it does have null references:

mutable struct A
    x
    A() = new()
end

julia> a = A()
A(#undef)

julia> a.x
ERROR: UndefRefError: access to undefined reference

julia> a.x = 3
3

But it’s fair to say that null references are only minimally supported by the language. I don’t think there’s even a way to set a.x back to #undef in the above example?

Presumably Julia allows these null references to help with otherwise difficult/impossible initialization of certain data structures?

1 Like

That’s quite different because you can’t do anything with it—any access of an undef storage location is an immediate error. So while you do have to worry about a location potentially being undef, you can never have an undef value, which is an important distinction. And once a location is defined it can never become undefined again. Getting rid of undef locations would be of interest for Julia 2.0, however. That can probably be handled with a mix of nothing unions and some clever way to specify the initialization of an array such that initialization is guaranteed.

6 Likes

For performance reasons, I would like this to happen only if the performance was equivalent. In the case of dynamic programming (which is performance sensitive) you may have algorithms that need X memory but execute C * N steps, where C is a constant and N < X. In such cases, having to initialize all memory may significantly impact the time of the algorithm. Maybe remove undef from Arrays but at the same time build Array over a new Buffer/MemoryBlock type which is more primitive.

I agree the restrictions in Julia are a significant improvement over first-class null references, but still, as I understand the “billion dollar mistake”, the idea is to have

  • nicer code: checking for “null” is completely unnecessary,
  • better performance: the compiler also doesn’t need to include a check (not that C or C++ bother anyway, but in Go for example, modulo static analysis, the compiler will add checks for nil everywhere, which would not be necessary if there was no nil).

Now consider the following code:

mutable struct A
    x
    A() = new()
end

function f(a::A)
    # To be sure we don't throw an exception:
    if !isdefined(a, :x)
        return 0
    end
    ...
end

Does the user need to check if a.x is defined? If they want to be really sure, yes. The type system doesn’t guarantee that it’s defined.

Say the user doesn’t check. Will the compiler itself add a check? Yes.

So in this sense Julia also suffers from this “mistake”.

To be clear, the modern conception of “avoiding the billion dollar mistake” is not to avoid null/nil/#undef completely, but to let the programmer distinguish nullable vs non-nullable values using different types (this is supported by Rust, Swift and Kotlin for example). Where Julia fails is that I cannot declare f to accept only A values with defined fields. I can explicitly accept “undefined” values by writing f(a::Union{Nothing,A}) or declaring the field as Union{Nothing,Any} or whatever, but there’s no type that says that a reference cannot be #undef.

Anyway, I agree having no null value is a huge improvement, it makes it much less likely to have an “undefined reference” exception in practice. And it’s great to hear that the developers consider getting rid of them in 2.0 :slight_smile:

The type A has effectively opted into undef checks here by having an inner constructor that creates an incompletely initialized instance. The compiler keeps track of this for each type and types that cannot be incompletely initialized aren’t impacted by this. Similarly, users don’t have to worry about undef fields here. Moreover, in general, types should never return constructed objects that have public uninitialized fields at all. If a user has to worry about fields being uninitialized, someone is doing it quite wrong. If a type wants to leave an internal field uninitialized, that’s it’s business. Similarly, APIs should never expose uninitialized arrays externally.

Compare that with languages where all references can be null: every single object is affected and there’s no way to opt out. That is the billion dollar mistake: making it everyone’s problem whether they want nullability or not.

3 Likes

It’s a bit like saying “Julia doesn’t have null references, unless a type opts out of this”. In other words, it does have null references. I mean, there’s a reason the language has an isdefined function…

And it’s not just structs with “bad” inner constructors, every object reference can be null: (Edit: the example below is misleading, see @StefanKarpinski’s message below)

mutable struct A
    x
end

function f(a::A)
    # No need to check isdefined(x), this cannot throw, right?
    return a.x
end

julia> f(Array{A}(undef, 3)[1])
ERROR: UndefRefError: access to undefined reference

You might say “that’s different! here it’s a bug in the code, one should never access an element of an undef-constucted array before initialization!” Well, of course it’s a bug. Same as all these C/C++ bugs we’re talking about: someone accessed a field that was not properly initialized.

So yes it should never happen. But that’s exactly the point of the billion dollar mistake: instead of “should not”, the type system should guarantee “cannot”. And here it doesn’t.

Again I agree that things in Julia are much better than most languages with null: there’s a strong sense that #undef shouldn’t be part of any API. I think it works well in practice, to the point that I never bother checking for #undef. The compiler however must still add a check…

Your example is a bit off: the undefined access happens upon array access, not inside of f—there is no way for f to ever get an “undef” instance of a nor is there any way for an instance of a to ever have an undefined x field. If you separate out the array access and the call to f you can see this:

julia> a = Array{A}(undef, 3)[1]
ERROR: UndefRefError: access to undefined reference

It never gets to the call to f and there is actually no need in f to check whether a.x is defined—it has to be. You can even see this in the generated code:

julia> code_native(f, Tuple{A})
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	_julia_f_328                    ## -- Begin function julia_f_328
	.p2align	4, 0x90
_julia_f_328:                           ## @julia_f_328
; ┌ @ REPL[8]:1 within `f`
	.cfi_startproc
## %bb.0:                               ## %top
; │ @ REPL[8]:3 within `f`
; │┌ @ Base.jl:38 within `getproperty`
	movq	(%rdi), %rax
; │└
	retq
	.cfi_endproc
; └
                                        ## -- End function
.subsections_via_symbols

There is no null check, just an unconditional field load.

Saying that “every object reference can be null” is simply incorrect. In fact, no object reference can ever be null/undef—as soon as you have a reference to an object, you know it must be defined. Locations can be undefined, and you can get an error when you try to load a reference from that location, but this is a very different thing. In other words, x.f or x[i] can be undefined but x cannot ever be undefined. That’s an significant difference and I’m not sure why you’re trying to insist that it’s no difference at all. And of course, x.f can only be undefined in a struct if the struct specifically opts into that possibility.

3 Likes

but notice something else

You’re still at the REPL, your environment is still alive.

you could have had a try catch finally block around it

You’re not back in the shell with a Segmentation fault (core dumped)

1 Like