I am working with DataArrays, and would like to calculate the mean of the DataArray content while ignoring NA, NaN and also Inf values. This seems like a basic use case, so I apologies in advance for failing in finding this in the documentation. My solution to the problem is to make a bit mask which indicates valid elements, but this approach (see below) left me puzzled by some weird behavior. Hence my two questions:
How can best I achieve selecting valid data from a data array?
What is going on in the script below, why isn’t isfinite(x) & !isna(x) giving consistent results?
julia> using DataFrames
WARNING: Method definition describe(AbstractArray) in module StatsBase at /Users/jon.alm.eriksen/.julia/v0.5/StatsBase/src/scalarstats.jl:573 overwritten in module DataFrames at /Users/jon.alm.eriksen/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:407.
julia> x = @data [1, NA, NaN, Inf]
4-element DataArrays.DataArray{Float64,1}:
1.0
NA
NaN
Inf
julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
true
false
true
true
julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
true
false
false
true
julia> isfinite(x) & !isna(x)
4-element DataArrays.DataArray{Bool,1}:
true
false
false
false
To skip missing values, you can use dropna as indicated in the DataFrames docs. If you additionally need to skip infinite values, you can do mean(filter(isfinite, dropna(x))) or (to eliminate one copy) mean(Base.Iterators.filter(isfinite, dropna(x))). There are probably more efficient ways but at least this works.
Thanks @nalimilan. The filter command works fine if you want the value, and don’t care about the bit mask (which perfectly solves the mean problem in the original question). But I can think of many reasons for wanting a bit mask. Consider e.g. a DataFrame where you want to select subset of rows, based on a series of condition.
Consider the dataframe:
df = DataFrame(A=1:4, B=1:4)
df[1,:A] = NA
df[3,:B] = NA
Also, it appears that using the new syntax isfinite.(x) .& .!isna.(x) on Julia 0.6 does not trigger the bug. So if you can start using that version (stable release is coming soon), that’s another solution.