A few questions on Julia's missing values, and how they compare to Python and R

I have read a lot about how Julia handles missing values:
https://docs.julialang.org/en/v1/manual/missing/

but what’s still not clear to me is the differences (and pros and cons) vs how Python and R treat them. Could someone help a total newbie to Julia shed some light?

  1. A big bugbear of mine used to be that pandas doesn’t support nullable ints, only nullable floats. This has now been addressed, but is not ideal as it has introduced differences between pandas.na and numpy.nan Is there any comparable confusion in Julia?

  2. Another bugbear used to be that pandas would remove records in groupbys, if grouping by a null variable. Say the field “city” contains “New York” 5 times and then 3 null records, then a count by city would show only “New York: 5”. This has been addressed in pandas with the dropna argument, but maybe not in R (not sure). How does Julia do it?

  3. What are the advantages of having a ‘missing’ which is different from nan? is it that it lets you distinguish between data which is missing (e.g. not collected, not known at all) vs data which is the result of an incorrect calculation? Or is there something else?

  4. Can you think of any other meaningful differences among the 3 languages when it comes to missing values?

I think Julia has missing because it works with any type, and generic algorithms are very important in Julia. NaN, e.g. is a special float value, and not every data type has something like this. Float64 has a couple of “unused” bit patterns, and R uses one of them as NA. But integers don’t have that, every bit pattern is a valid integer, so R just uses the largest negative one as NA, thinking the users care more about the convenience of having NA than having the full integer range available. But you see that such approaches are a bit “hacky” and more geared towards convenience than correctness or following standards. For every type where you want to support NA, you have to come up with some non-standard interpretation of a special value of that type. In Julia, arrays where missing values can appear are always typed Vector{Union{Missing, SomeType}}, and whether or not something is missing is stored in a separate vector. This way it works with any type, but it also increases memory use a bit. It’s a tradeoff in that regard. But if you have a Vector{Int} in Julia, you know that there can’t possibly be a missing value in there, while in R you have to check. So that might be a speed disadvantage for R in some scenarios.

3 Likes

Using NaN for missing means that you’ve ceded all control over the semantics of your missing value system to the existing semantics of floating point numbers. That’s why (1) occurred in Pandas – they had to coerce ints to floating point to make use of NaN’s since they had committed to floating point as a core part of their representation of missing values. But R also reused parts of the NaN bitspace to represent NA and that bit them once Apple built its own chips: Will R Work on Apple Silicon? - The R Blog

1 Like

To your points (1) and (2):

  1. There is definitely potential for this type of confusion as it is easy and packages are free to define their own missing-like type. In practice, I never encounter problems - missing is the de facto standard used in most places. The exception to this is probably the Queryverse, which I think has a different way of representing missings.

  2. Again this could depend on packages used, but I would say the de-factor standard comparator to pandas here would be DataFrames, where you get:

julia> using DataFrames

julia> df = DataFrame(a = ["NY", "NY", "NY", missing, missing], b = rand(5));

julia> combine(groupby(df, :a), nrow, :b => sum)
2×3 DataFrame
 Row │ a        nrow   b_sum   
     │ String?  Int64  Float64 
─────┼─────────────────────────
   1 │ NY           3  1.31219
   2 │ missing      2  1.7638
1 Like