How to single out valid data from DataArray?

jonalm · June 7, 2017, 2:34pm

Hi there,

I am working with DataArrays, and would like to calculate the mean of the DataArray content while ignoring NA, NaN and also Inf values. This seems like a basic use case, so I apologies in advance for failing in finding this in the documentation. My solution to the problem is to make a bit mask which indicates valid elements, but this approach (see below) left me puzzled by some weird behavior. Hence my two questions:

How can best I achieve selecting valid data from a data array?
What is going on in the script below, why isn’t isfinite(x) & !isna(x) giving consistent results?

julia> using DataFrames
WARNING: Method definition describe(AbstractArray) in module StatsBase at /Users/jon.alm.eriksen/.julia/v0.5/StatsBase/src/scalarstats.jl:573 overwritten in module DataFrames at /Users/jon.alm.eriksen/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:407.

julia> x = @data [1, NA, NaN, Inf]
4-element DataArrays.DataArray{Float64,1}:
   1.0
    NA
 NaN
 Inf

julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
  true
 false
  true
  true

julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
  true
 false
 false
  true

julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
  true
 false
 false
 false

Best
Jon

nalimilan · June 7, 2017, 5:00pm

To skip missing values, you can use dropna as indicated in the DataFrames docs. If you additionally need to skip infinite values, you can do mean(filter(isfinite, dropna(x))) or (to eliminate one copy) mean(Base.Iterators.filter(isfinite, dropna(x))). There are probably more efficient ways but at least this works.
As for the inconsistent results, I think you’re hitting this embarassing issue.

jonalm · June 8, 2017, 12:01pm

Thanks @nalimilan. The filter command works fine if you want the value, and don’t care about the bit mask (which perfectly solves the mean problem in the original question). But I can think of many reasons for wanting a bit mask. Consider e.g. a DataFrame where you want to select subset of rows, based on a series of condition.

Consider the dataframe:

df = DataFrame(A=1:4, B=1:4)
df[1,:A] = NA
df[3,:B] = NA

I want all rows which satisfy:

select = !isna(df[:A])  & (df[:A].>2) & (isna(df[:B]) | (df[:B] .> 1))
dfselect =df[select]

Is there a work around for achieving this which isn’t affected by the embarrassing issue?

I’ve tried

filter(row -> !isna(row[:A]) && (row[:A]>2) &&  (isna(row[:B]) || (row[:B] .> 2))), eachrow(df))

but it outputs an object of type Filter which I don’t know how to handle.

PS:
I believe the issue you’re referring to makes the proposed solution to this problem invalid. Julia DataFrames.jl - filter data with NA's (NAException) - Stack Overflow
Am I correct?

Jon

nalimilan · June 8, 2017, 12:47pm

You can use find rather than filter, it will return the indices and should work around the bug. We should really find a fix.

nalimilan · June 8, 2017, 9:25pm

Also, it appears that using the new syntax isfinite.(x) .& .!isna.(x) on Julia 0.6 does not trigger the bug. So if you can start using that version (stable release is coming soon), that’s another solution.

Topic		Replies	Views
How to filter out rows with NaN in specific fields? New to Julia dataframes	2	3415	October 24, 2019
How to compare non-missing elements of two DataFrames New to Julia	3	567	July 1, 2020
A few questions on Julia's missing values, and how they compare to Python and R New to Julia nan	3	622	February 25, 2021
Query.jl - filtering on missing data Data	7	1553	September 21, 2018
Help me in machine learning New to Julia question	2	428	March 31, 2022

How to single out valid data from DataArray?

Related topics