Trapped in a int -> missing -> float loop

Let’s say that I have a dataframe like

 df = DataFrame(a=[ 1, 4, missing], b= [1, 4, 5], c=[1., 2, 3])
3×3 DataFrame
 Row │ a        b      c
     │ Int64?   Int64  Float64
─────┼─────────────────────────
   1 │       1      1      1.0
   2 │       4      4      2.0
   3 │ missing      5      3.0

and want to extract all data as matrix of doubles, but can’t.

Matrix{Float64}(df)
ERROR: ArgumentError: cannot convert a DataFrame containing missing values to Matrix{Float64} (found for column a)

The problem seems to be related to fact that one cannot (or better, I cannot find a way to)

x = [1, 2, missing];
x[3] = NaN
ERROR: InexactError: Int64(NaN)
 Float64.(x)
ERROR: MethodError: no method matching Float64(::Missing)

So, we cant replace the missing by a float because the vector is of Integer type and can’t convert to float because of the missing

Any way out of this trap (other that making column copies and loop-with-ifs my way out of this)?

Matrix(df) is not working?

1 Like

Sorry, forgot a detail. I need to get read of the missing’s and Matrix(df) keeps them

Matrix(df)
3×3 Matrix{Union{Missing, Float64}}:
 1.0       1.0  1.0
 4.0       4.0  2.0
  missing  5.0  3.0

Use coalesce.

Hmm, how?

coalesce(Matrix(df), missing)
3×3 Matrix{Union{Missing, Float64}}:
 1.0       1.0  1.0
 4.0       4.0  2.0
  missing  5.0  3.0

and again sorry, not yet full info. I need to replace the missing’s by NaN because result is intended to be sent to C lib.

OK, contrieved but I can do

replace!(Matrix(df), missing => NaN)
3×3 Matrix{Union{Missing, Float64}}:
   1.0  1.0  1.0
   4.0  4.0  2.0
 NaN    5.0  3.0

Try
coalesce.(df, NaN)

4 Likes

Thanks, that’s better as the result is directly a simple plain matrix

Matrix(coalesce.(df, NaN))
3×3 Matrix{Float64}:
   1.0  1.0  1.0
   4.0  4.0  2.0
 NaN    5.0  3.0
1 Like

This is why I’ve said in that “missing” is an integer.

Whereas NaN is often used as the missing for floating point.

They represent the same concept but don’t interoperate.

I don’t know if it would be useful or harmful to define conversions like:

Float64(missing) == NaN
(and similar for NaN64, NaN32, NaN16)

Int64(NaN) == missing
(and similar for Int32, Int16)

isnan(missing) == true

ismissing(NaN) == true

2 Likes

Interesting idea. I started this comment disagreeing with you, but as I was typing I realized I was wrong.

NaN and missing are indeed the same concept.

NaN is the result of 0/0 and Inf/Inf. In real life we’d solve such problems using L’Hôpital’s rule and get the numeric value, but the computer doesn’t always get to do this so the numeric value is unknown.

NaN is a misnomer: not a number. But it actually is a number; just a number whose value we don’t know.

Which is exactly what missing means when in numeric contexts.

Probably not helpful rehashing this here, but there’s been a lot of discussion on this when missing first came around and since then, and the reason is exists is precisely because it is meant to be distinct from NaN (and nothing). Here’s an old SO question:

But if you look around you’ll certainly find loads more.

1 Like