How to get ranges of valid data in vectors or dataframes?

Hi,

using DataFrames

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])
12×2 DataFrame
 Row │ col1     col2    
     │ Float64  Int64?  
─────┼──────────────────
   1 │     1.0  missing 
   2 │   NaN          2
   3 │     3.0        3
   4 │     4.0        4
   5 │   NaN          5
   6 │     6.0        6
   7 │     7.0  missing 
   8 │     8.0        8
   9 │     9.0        9
  10 │    10.0       10
  11 │   NaN         11
  12 │   NaN         12

So for either column, I would like to get the maximum ranges of contiguous valid data, e.g. the lines (indexes or actual lines) range 6 to 10 for column 1.
Or for missings 2 to 6 and 8 to 11 for column 2.
I started to think about running through the lines etc. but I guess that there’s a more native way to do that.

Here would be an example of how I would do it

function max_valid_run(v)
    max_len = 0
    curr_len = 0
    for x in v
        if x !== missing && !isnan(x)
            curr_len += 1
            max_len = max(max_len, curr_len)
        else
            curr_len = 0
        end
    end
    return max_len
end

mapcols(col -> max_valid_run(col), df)
1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5     5

Probably is a more efficient way of checking the columns but have to pay some care as isnan(missing) gives missing as opposed to false so !isnan(x) && x !== missing would throw a TypeError

1 Like

Same idea, compressed to one liner:

julia> mapcols(c->maximum(diff(findall(
  (x->ifelse(ismissing(x),true,isnan(x))).(c))))-1, df)
1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5      5
1 Like

The code below outputs vectors of unit ranges as requested:

using DataFrames

function max_valid_ranges(v)
    ix = findall(in((NaN, missing)), v)
    (1 ∉ ix) && pushfirst!(ix, 0)
    n = length(v)
    (n ∉ ix) && append!(ix, n + 1)
    d = diff(ix)
    iy = findall(==(maximum(d)), d)
    return [(ix[i]+1):(ix[i]+d[i]-1) for i in iy]
end

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])

max_valid_ranges(df.col1)   # [6:10]
max_valid_ranges(df.col2)   # [2:6; 8:12]
2 Likes

Thanks a lot !
I’ve never used Julia’s mathematical syntax, I guess that the time has come !

I assume you’re referring to the .

(1 ∉ ix) reads pretty nice and is my preferred syntax for this by a lot, but note that you can instead write it as !in(1, ix) if you prefer or need to avoid non-ASCII in code.

Any Unicode operator or function name in Julia base language will always have an equivalent pure ASCII version of it that works the same. So the mathematical syntax is really nice to have, but always optional.

to your point,

const ∈ = in
∉(x, itr) = !∈(x, itr)
∉(itr) = Fix2(∉, itr)

these are defined in Base

Not xor(), you must use (sigh…)