How to single out valid data from DataArray?

question

#1

Hi there,

I am working with DataArrays, and would like to calculate the mean of the DataArray content while ignoring NA, NaN and also Inf values. This seems like a basic use case, so I apologies in advance for failing in finding this in the documentation. My solution to the problem is to make a bit mask which indicates valid elements, but this approach (see below) left me puzzled by some weird behavior. Hence my two questions:

  1. How can best I achieve selecting valid data from a data array?
  2. What is going on in the script below, why isn’t isfinite(x) & !isna(x) giving consistent results?
julia> using DataFrames
WARNING: Method definition describe(AbstractArray) in module StatsBase at /Users/jon.alm.eriksen/.julia/v0.5/StatsBase/src/scalarstats.jl:573 overwritten in module DataFrames at /Users/jon.alm.eriksen/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:407.

julia> x = @data [1, NA, NaN, Inf]
4-element DataArrays.DataArray{Float64,1}:
   1.0
    NA
 NaN
 Inf

julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
  true
 false
  true
  true

julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
  true
 false
 false
  true

julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
  true
 false
 false
 false

Best
Jon


#2
  1. To skip missing values, you can use dropna as indicated in the DataFrames docs. If you additionally need to skip infinite values, you can do mean(filter(isfinite, dropna(x))) or (to eliminate one copy) mean(Base.Iterators.filter(isfinite, dropna(x))). There are probably more efficient ways but at least this works.

  2. As for the inconsistent results, I think you’re hitting this embarassing issue.


#3

Thanks @nalimilan. The filter command works fine if you want the value, and don’t care about the bit mask (which perfectly solves the mean problem in the original question). But I can think of many reasons for wanting a bit mask. Consider e.g. a DataFrame where you want to select subset of rows, based on a series of condition.

Consider the dataframe:

df = DataFrame(A=1:4, B=1:4)
df[1,:A] = NA
df[3,:B] = NA

I want all rows which satisfy:

select = !isna(df[:A])  & (df[:A].>2) & (isna(df[:B]) | (df[:B] .> 1))
dfselect =df[select]

Is there a work around for achieving this which isn’t affected by the embarrassing issue?

I’ve tried

filter(row -> !isna(row[:A]) && (row[:A]>2) &&  (isna(row[:B]) || (row[:B] .> 2))), eachrow(df))

but it outputs an object of type Filter which I don’t know how to handle.

PS:
I believe the issue you’re referring to makes the proposed solution to this problem invalid. https://stackoverflow.com/questions/31329808/julia-dataframes-jl-filter-data-with-nas-naexception
Am I correct?

Jon


#4

You can use find rather than filter, it will return the indices and should work around the bug. We should really find a fix.


#5

Also, it appears that using the new syntax isfinite.(x) .& .!isna.(x) on Julia 0.6 does not trigger the bug. So if you can start using that version (stable release is coming soon), that’s another solution.