How can I return the number of missings / uniques of a numerical type column?

Julia1 · June 19, 2020, 1:25pm

Hi how’s it going?

I’m trying to filter columns based on number of unique values and number of missings, and I just noticed that when calling the Describe() method on a DataFrame, it doesn’t work for numerical columns.

Whats the best solution to this?

df = DataFrame(:x=>[1.0,2.0,3.0],:y=>["1","2",missing])
describe(df)[!,[:variable,:nunique,:nmissing,:eltype]]

returns

Henrique_Becker · June 19, 2020, 1:44pm

I do not think that it is that it does not work on numeric columns, but instead that it displays nothing instead of zero when the column has no missing values, or when the column type does not allow for missing values. What do you get in the two cases below?

df = DataFrame(:x=>[1.0,2.0,missing],:y=>["1","2",missing])
describe(df)[!,[:variable,:nunique,:nmissing,:eltype]]

and

df = DataFrame(:x=>Union{Float64,Missing}[1.0,2.0,3.0],:y=>["1","2",missing])
describe(df)[!,[:variable,:nunique,:nmissing,:eltype]]

Julia1 · June 19, 2020, 1:54pm

In both examples, the nunique column shows up empty, and the nmissing column shows up as 1 and 0, respectively. How come it doesn’t return the number of unique float values?

Henrique_Becker · June 19, 2020, 2:11pm

I am not sure of the rationale, but it is clear that it is intended behavior. Probably because it leads to confusion when very similar values (often displayed with the same string) are considered different. If you want to get the number of unique values for a <: Real column you probably have to do it yourself with the epsilon most adequate for your case, instead of relying in code that is blind to orders of magnitude.

pdeffebach · June 19, 2020, 2:56pm

read the documentation for describe with ?describe.

nunique is nothing for <: Real because in the vast majority of cases where columns are of type <: Real this is an expensive operation and we want describe to be fast. There is an issue for this here.

nmissing is nothing when the column doesn’t allow missing values because we want to distinguish “allows missing values but there are none of them” from “does not allow missing values”. This is covered in the documentation from ?describe.

Why are you using describe for this task? Why not use countmap, length(unique(x)), or count(ismissing, x) to perform this operation?

Topic		Replies	Views
Is there a way to find nunique and nmissing values for an integer column? General Usage question , package	3	900	July 27, 2020
Counting missing in a dataframe General Usage dataframes	6	2420	April 28, 2021
Iterate over all numeric columns in DataFrames Data	21	4828	February 11, 2018
Replace missing values based on column data type General Usage package , plotting , strings , dataframes , missing-values	7	929	February 10, 2023
Detecting missing in DataFrame columns New to Julia	6	5766	April 6, 2021

How can I return the number of missings / uniques of a numerical type column?

Related topics