How can I return the number of missings / uniques of a numerical type column?

Hi how’s it going?

I’m trying to filter columns based on number of unique values and number of missings, and I just noticed that when calling the Describe() method on a DataFrame, it doesn’t work for numerical columns.

Whats the best solution to this?

df = DataFrame(:x=>[1.0,2.0,3.0],:y=>["1","2",missing])
describe(df)[!,[:variable,:nunique,:nmissing,:eltype]]

returns

I do not think that it is that it does not work on numeric columns, but instead that it displays nothing instead of zero when the column has no missing values, or when the column type does not allow for missing values. What do you get in the two cases below?

df = DataFrame(:x=>[1.0,2.0,missing],:y=>["1","2",missing])
describe(df)[!,[:variable,:nunique,:nmissing,:eltype]]

and

df = DataFrame(:x=>Union{Float64,Missing}[1.0,2.0,3.0],:y=>["1","2",missing])
describe(df)[!,[:variable,:nunique,:nmissing,:eltype]]

In both examples, the nunique column shows up empty, and the nmissing column shows up as 1 and 0, respectively. How come it doesn’t return the number of unique float values?

I am not sure of the rationale, but it is clear that it is intended behavior. Probably because it leads to confusion when very similar values (often displayed with the same string) are considered different. If you want to get the number of unique values for a <: Real column you probably have to do it yourself with the epsilon most adequate for your case, instead of relying in code that is blind to orders of magnitude.

read the documentation for describe with ?describe.

nunique is nothing for <: Real because in the vast majority of cases where columns are of type <: Real this is an expensive operation and we want describe to be fast. There is an issue for this here.

nmissing is nothing when the column doesn’t allow missing values because we want to distinguish “allows missing values but there are none of them” from “does not allow missing values”. This is covered in the documentation from ?describe.

Why are you using describe for this task? Why not use countmap, length(unique(x)), or count(ismissing, x) to perform this operation?

2 Likes