There is no Union{Float64,NaNType}
, so any float can be a NaN
and programmers often but not always have to check for isnan
. Is there an argument that NaN
is problematic in a similar way as null is?
The difference is that within IEEE arithmetic, NaN has well defined semantics on all operations rather than throwing random errors.
maybe you could have a a look at https://github.com/JuliaData/SentinelArrays.jl
This; but also people who write math primitives do have to check for NaN all the time and handle it specially.
due to the way nan propegation works, they often fall out correctly without special handling (but not always).
Depending on what you are doing, NaNs can propagate so that all your data become NaN and you are stuck backtracking to see where the offending NaN was generated that destroyed everything in its wake. In the past, at least, arithmetic with NaN was far slower than with non-NaNs. In an earlier project I worked on, we had code to replace NaN sentinels before to any calculation, even if the calculations didnât result in NaN propagation corrupting any other values.
Well, maybe corrupting is not the best description. What is better?
mean([rand(1, 100) NaN])
NaN
mean([rand(1, 100) -999999])
-9900.487109109194
A few random comments.
In a way, missing is an âintegerâ. With integer arrays, one would use missing to represent . . . wait for it . . . missing values. For float arrays, one uses NaN. For strings, one uses the empty string.
Although Julia makes it very easy to using missing wherever one wants to via type unions. Somewhere in the documentation it says something like, âdonât be afraid of union types, they are fastâ â i.e., donât be afraid theyâll slow down your code if you use them.
Microprocessors are designed to process data. They donât do as well processing no data. The NaN of the FP unit has been pressed into service as a missing value (the standard should be updated to have an explicit missing value in addition to NaNs â following the principle of donât use one thing for two different purposes). No support on the integer side for missing values. I may still know an Intel Design Fellow; I was thinking of sending him this idea: use two bits behind the sign bit as flags for missing/null/unknown (00=present; 10=missing; 01=null, 11=unknow). Then, for every arithmetic and logic computation, apply tri-value logic rules to these, then set two bits in a flag register with the result. If you could then have a conditional jump based on those register bits, you could have high performance support for missing/null/unknown values. Of course, software using this feature has their integer size reduced to 61 bits (likely not a hardship for most applications). Since weâre unlikely to get 66 bit memory chips anytime soon, itâs one way to retrofit missing value support into the architecture.
Yeah, maybe corrupting is not the best term. Neither result is good, of course. I was thinking NaN more of something like the following, and it can make finding original the NaN difficult, but at least you know something went wrong. Blinding processing a sentinel like -999999 may result in relatively undetectable bad data.
a=rand(10)
a[3] = NaN # From somewhere
@show a ./ mean(a)
a ./ mean(a) = [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN]
Also, I have preferred using NaN as a missing value because choosing a sentinel non-NaN value is not obvious in general purpose code. This is all independent of Julia, and I think we agree, so Iâll stop now.
I donât think of missing
as an Int
, but rather a placeholder value with support for tri-valued logic on the language (the special support is the short-circuit logic operators).
It doesnât have semantics of Int
neither its use of memory. Also NaN
or and empty String
is not necessary a missing values, as any other value could be used as placeholders. Just NaN
âs or ""
are values not seek for on a lot of applications (rather quite big generalization, specially for empty Strings
), so they are useful for those cases.
On floating arithmetic sometimes you prefer NaN
poisoning over Missing
poisoning, or maybe rather than preferring one over the other you want them along and be able to differentiate them.
Yes, I believe too. Happen to be looking into a Sea Ice thickness
nc file from some institution that uses -9999 without even advertising it with the FillValue
attribute. Shame.
Oracle using ââ instead of NULL is a proper pain. Even worse when you read their docs
https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements005.htm
Oracle Database currently treats a character value with a length of zero as null. However, this may not continue to be true in future releases, and Oracle recommends that you do not treat empty strings the same as nulls.
Julia doesnât quite avoid it though, it does have null references:
mutable struct A
x
A() = new()
end
julia> a = A()
A(#undef)
julia> a.x
ERROR: UndefRefError: access to undefined reference
julia> a.x = 3
3
But itâs fair to say that null references are only minimally supported by the language. I donât think thereâs even a way to set a.x
back to #undef
in the above example?
Presumably Julia allows these null references to help with otherwise difficult/impossible initialization of certain data structures?
Thatâs quite different because you canât do anything with itâany access of an undef storage location is an immediate error. So while you do have to worry about a location potentially being undef, you can never have an undef value, which is an important distinction. And once a location is defined it can never become undefined again. Getting rid of undef locations would be of interest for Julia 2.0, however. That can probably be handled with a mix of nothing unions and some clever way to specify the initialization of an array such that initialization is guaranteed.
For performance reasons, I would like this to happen only if the performance was equivalent. In the case of dynamic programming (which is performance sensitive) you may have algorithms that need X
memory but execute C * N
steps, where C
is a constant and N < X
. In such cases, having to initialize all memory may significantly impact the time of the algorithm. Maybe remove undef
from Arrays
but at the same time build Array
over a new Buffer
/MemoryBlock
type which is more primitive.
I agree the restrictions in Julia are a significant improvement over first-class null references, but still, as I understand the âbillion dollar mistakeâ, the idea is to have
- nicer code: checking for ânullâ is completely unnecessary,
- better performance: the compiler also doesnât need to include a check (not that C or C++ bother anyway, but in Go for example, modulo static analysis, the compiler will add checks for
nil
everywhere, which would not be necessary if there was nonil
).
Now consider the following code:
mutable struct A
x
A() = new()
end
function f(a::A)
# To be sure we don't throw an exception:
if !isdefined(a, :x)
return 0
end
...
end
Does the user need to check if a.x
is defined? If they want to be really sure, yes. The type system doesnât guarantee that itâs defined.
Say the user doesnât check. Will the compiler itself add a check? Yes.
So in this sense Julia also suffers from this âmistakeâ.
To be clear, the modern conception of âavoiding the billion dollar mistakeâ is not to avoid null/nil/#undef completely, but to let the programmer distinguish nullable vs non-nullable values using different types (this is supported by Rust, Swift and Kotlin for example). Where Julia fails is that I cannot declare f
to accept only A
values with defined fields. I can explicitly accept âundefinedâ values by writing f(a::Union{Nothing,A})
or declaring the field as Union{Nothing,Any}
or whatever, but thereâs no type that says that a reference cannot be #undef
.
Anyway, I agree having no null value is a huge improvement, it makes it much less likely to have an âundefined referenceâ exception in practice. And itâs great to hear that the developers consider getting rid of them in 2.0
The type A
has effectively opted into undef checks here by having an inner constructor that creates an incompletely initialized instance. The compiler keeps track of this for each type and types that cannot be incompletely initialized arenât impacted by this. Similarly, users donât have to worry about undef fields here. Moreover, in general, types should never return constructed objects that have public uninitialized fields at all. If a user has to worry about fields being uninitialized, someone is doing it quite wrong. If a type wants to leave an internal field uninitialized, thatâs itâs business. Similarly, APIs should never expose uninitialized arrays externally.
Compare that with languages where all references can be null: every single object is affected and thereâs no way to opt out. That is the billion dollar mistake: making it everyoneâs problem whether they want nullability or not.
Itâs a bit like saying âJulia doesnât have null references, unless a type opts out of thisâ. In other words, it does have null references. I mean, thereâs a reason the language has an isdefined
functionâŚ
And itâs not just structs with âbadâ inner constructors, every object reference can be null: (Edit: the example below is misleading, see @StefanKarpinskiâs message below)
mutable struct A
x
end
function f(a::A)
# No need to check isdefined(x), this cannot throw, right?
return a.x
end
julia> f(Array{A}(undef, 3)[1])
ERROR: UndefRefError: access to undefined reference
You might say âthatâs different! here itâs a bug in the code, one should never access an element of an undef
-constucted array before initialization!â Well, of course itâs a bug. Same as all these C/C++ bugs weâre talking about: someone accessed a field that was not properly initialized.
So yes it should never happen. But thatâs exactly the point of the billion dollar mistake: instead of âshould notâ, the type system should guarantee âcannotâ. And here it doesnât.
Again I agree that things in Julia are much better than most languages with null: thereâs a strong sense that #undef
shouldnât be part of any API. I think it works well in practice, to the point that I never bother checking for #undef
. The compiler however must still add a checkâŚ
Your example is a bit off: the undefined access happens upon array access, not inside of f
âthere is no way for f
to ever get an âundefâ instance of a
nor is there any way for an instance of a
to ever have an undefined x
field. If you separate out the array access and the call to f
you can see this:
julia> a = Array{A}(undef, 3)[1]
ERROR: UndefRefError: access to undefined reference
It never gets to the call to f
and there is actually no need in f
to check whether a.x
is definedâit has to be. You can even see this in the generated code:
julia> code_native(f, Tuple{A})
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0
.globl _julia_f_328 ## -- Begin function julia_f_328
.p2align 4, 0x90
_julia_f_328: ## @julia_f_328
; â @ REPL[8]:1 within `f`
.cfi_startproc
## %bb.0: ## %top
; â @ REPL[8]:3 within `f`
; ââ @ Base.jl:38 within `getproperty`
movq (%rdi), %rax
; ââ
retq
.cfi_endproc
; â
## -- End function
.subsections_via_symbols
There is no null check, just an unconditional field load.
Saying that âevery object reference can be nullâ is simply incorrect. In fact, no object reference can ever be null/undefâas soon as you have a reference to an object, you know it must be defined. Locations can be undefined, and you can get an error when you try to load a reference from that location, but this is a very different thing. In other words, x.f
or x[i]
can be undefined but x
cannot ever be undefined. Thatâs an significant difference and Iâm not sure why youâre trying to insist that itâs no difference at all. And of course, x.f
can only be undefined in a struct if the struct specifically opts into that possibility.
but notice something else
Youâre still at the REPL, your environment is still alive.
you could have had a try catch finally
block around it
Youâre not back in the shell with a Segmentation fault (core dumped)