Can you tell me a little bit more about how this new type works? What would you like it to do? Is it basically missing with different default behavior?
Here’s some code to start with along the lines of what I provided above.
Setup code for `MyMissing` and `MyMissingDataFrame`
One of the issues with adding a new missing type is that the current missing is an integral part of three-valued logic (3VL) in Julia. Here’s the 3VL truth table for |:
If we add newmissing, that truth table will explode in size—we would now be dealing with four-valued logic (4VL). That sounds like a nightmare to me.
To clarify, the three values referred to in 3VL are three logical values: true, false, and unknown. Julia follows SQL and R and cheats by representing unknown as missing. (This has some subtle issues, since unknown == unknown should return true, but missing == missing returns missing.) A four-valued logic would have a fourth logical value, like newunknown. But continuing with the missing tradition, a 4VL logical vector would be of type Union{Boolean, Missing, NewMissing}.
For instance, these are some functions I’d apply for the types of missing I have in mind.
Ideally, my code would be more readable if I could directly use expressions like x .> 0 instead. In addition, functions like sum, mean, cor, etc should also ignore the type of missing.
In any case, the implementation may be so complicated that it’d be too hard to do. I imagine there must be lots of special cases that should be thought of, unintended consequences I’m not considering, breaking composability with other packages, etc.
Imho, there is not much difference between using a type providing additional methods vs shadowing existing methods. In both cases, the underlying problem is that there are two reasonable default choices of how missing values are to be treated and we need some way to configure code to select one of them:
# Pick any of these methods to configure your analysis
# 1. Import shadowing methods
using Missings.SkipMissing: mean
# 2. Convert data to new type
df = MyMissingDataFrame(df)
...
# some analysis much later in the file
mean(df.whatever)
In any case, functions with the same name are configured to behave differently. More local approaches are also conceivable, i.e., using macros or blocked definitions as syntactic markers:
let mean = mean ∘ skipmissing
...
mean(df.whatever)
end
# or macro based
@with_missing_skipped begin
...
mean(df.whatever)
end
In the end, it’s mainly about defaults and how easily – also in terms of readability – these can be changed.
yes, when I said we should wait for Julia 2.0, I was trying to convey the idea that a new type of missing is maybe too complex and affects the fundamentals of Julia. So that maybe there’s no solution, explaining why Python and other languages had to made a decision about it, but none of them covers all cases.
I don’t see what you mean. 3VL is not a concept in Julia. Missing is just some type, we can introduce another type that behaves similarly. Why would we need 4VL instead of just Union{Bool, NewMissing}?
This is not too hard actually. If I understand your code, the binary operator always returns false if a missing is present, right?
Also, I think you could have written these more succintly.
ishigheq(a, b) = !ismissing(a) && a >= b
ishigher(a, b) = !ismissing(a) && a > b
islower(a, b) = !ismissing(a) && a < b
isloweq(a, b) = !ismissing(a) && a <= b
Logical (or boolean) operators |, & and xor are another special case since they only propagate missing values when it is logically required. For these operators, whether or not the result is uncertain, depends on the particular operation. This follows the well-established rules of three-valued logic which are implemented by e.g. NULL in SQL and NA in R.
In other words, the correct propagation of missing has already been defined for the logical operators |, &, xor, and !. Presumably we would also need to expand the truth tables to include newmissing, if it was introduced.
I’m trying really, really hard to disabuse of you this notion. Julia was built to do exactly this. Fundamental to the design of Julia 1 is the ability to extend operators to new types. We do not need a Julia 2 to do this.
This is not required. You can simply require that the user choose one or the other, and never define any methods for mixing the two. The idea being that the type of missing is injected into the data early on (preferably when reading the file) and never changed.
I don’t think we need to define Base.|(::Missing, ::NewMissing) etc. The answer isn’t obvious and people should specify what they mean when they’re mixing these together.
I’m saying the user should decide what should happen when combining Missing with NewMissing. But normally my data won’t have NewMissing and @adienes data won’t have Missing, it’s usually one or the other – mixing them will be less common.
I think it is important not to pile too many features onto any of the proposals to keep them actionable. The main complaint has been that default behavior should be different, so in my view one first-order solution is a package which provides a type that is a simple replacement of Base.missing and has all the desired semantics. Basically, do in a package what you would have done in Base if you had been in charge before Julia v1. There is nothing special about things defined in Base.
It sounds like one of the main goals of newmissing would be to make mean(x) and sum(x) “just work”. But in order for these functions to “just work”, they would need special handling of newmissing. For example, a naive implementation of mean might look like this:
function mean(itr)
sum = 0
n = 0
for x in itr
if x !== newmissing
sum += x
n += 1
end
end
sum/n
end
Thus, only some aggregation functions would “just work”. Other aggregation functions that do not have special handling for newmissing would not work. This is basically the same situation as defining smean, ssum, etc, but at least with smean we are not defining a new missing data type.
I’m not so sure a new type will help much. because after all, CSV.jl or lag will fill empty values with missing in a way I cannot control, so I’d have to coalesce.(input, newmissing) everywhere anyway
that’s exactly one of the features I was thinking. It leads me to think that the implementation of a new type is too complex. It’ll probably break composability with other packages also.
And the other possible solutions look more like patches, not actually reflecting a structural solution.
Is it me or does this not do anything like skipmissing? For one, it only skips the comparison to a default false if a is missing, it’ll propagate missing b. For another, it’s a scalar operation, skipmissing removes missing elements from collections.
Creating data indistinguishable from real data for missing a and treating missing b normally is too specific, it just doesn’t seem like something that should be handled by a type, macro, or derivative functions with such a blanket setting. This could instead be accomplished by processing a collection of a values (!ismissing).(A) before doing anything with b, perhaps even lazily by some value substitution wrapper.
Should be !ismissing(a) because abool == 0 inverts abool.