How to get ranges of valid data in vectors or dataframes?

Ju_ska · June 18, 2025, 9:49am

Hi,

using DataFrames

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])

12×2 DataFrame
 Row │ col1     col2    
     │ Float64  Int64?  
─────┼──────────────────
   1 │     1.0  missing 
   2 │   NaN          2
   3 │     3.0        3
   4 │     4.0        4
   5 │   NaN          5
   6 │     6.0        6
   7 │     7.0  missing 
   8 │     8.0        8
   9 │     9.0        9
  10 │    10.0       10
  11 │   NaN         11
  12 │   NaN         12

So for either column, I would like to get the maximum ranges of contiguous valid data, e.g. the lines (indexes or actual lines) range 6 to 10 for column 1.
Or for missings 2 to 6 and 8 to 11 for column 2.
I started to think about running through the lines etc. but I guess that there’s a more native way to do that.

lukketotte · June 18, 2025, 2:26pm

Here would be an example of how I would do it

function max_valid_run(v)
    max_len = 0
    curr_len = 0
    for x in v
        if x !== missing && !isnan(x)
            curr_len += 1
            max_len = max(max_len, curr_len)
        else
            curr_len = 0
        end
    end
    return max_len
end

mapcols(col -> max_valid_run(col), df)

1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5     5

Probably is a more efficient way of checking the columns but have to pay some care as isnan(missing) gives missing as opposed to false so !isnan(x) && x !== missing would throw a TypeError

Dan · June 19, 2025, 12:15am

Same idea, compressed to one liner:

julia> mapcols(c->maximum(diff(findall(
  (x->ifelse(ismissing(x),true,isnan(x))).(c))))-1, df)
1×2 DataFrame
 Row │ col1   col2  
     │ Int64  Int64 
─────┼──────────────
   1 │     5      5

rafael.guerra · June 19, 2025, 7:38am

The code below outputs vectors of unit ranges as requested:

using DataFrames

function max_valid_ranges(v)
    ix = findall(in((NaN, missing)), v)
    (1 ∉ ix) && pushfirst!(ix, 0)
    n = length(v)
    (n ∉ ix) && append!(ix, n + 1)
    d = diff(ix)
    iy = findall(==(maximum(d)), d)
    return [(ix[i]+1):(ix[i]+d[i]-1) for i in iy]
end

df = DataFrame(col1 = [1, NaN, 3, 4, NaN, 6, 7, 8, 9, 10, NaN, NaN], col2 = [missing, 2, 3, 4, 5, 6, missing, 8, 9, 10, 11, 12])

max_valid_ranges(df.col1)   # [6:10]
max_valid_ranges(df.col2)   # [2:6; 8:12]

Ju_ska · June 23, 2025, 12:55pm

Thanks a lot !
I’ve never used Julia’s mathematical syntax, I guess that the time has come !

digital_carver · June 23, 2025, 3:48pm

I assume you’re referring to the ∉.

(1 ∉ ix) reads pretty nice and is my preferred syntax for this by a lot, but note that you can instead write it as !in(1, ix) if you prefer or need to avoid non-ASCII in code.

Any Unicode operator or function name in Julia base language will always have an equivalent pure ASCII version of it that works the same. So the mathematical syntax is really nice to have, but always optional.

jling · June 23, 2025, 4:13pm

to your point,

const ∈ = in
∉(x, itr) = !∈(x, itr)
∉(itr) = Fix2(∉, itr)

these are defined in Base

devel-chm · June 23, 2025, 6:59pm

Not xor(), you must use ⊻ (sigh…)

Topic		Replies	Views
Finding DataFrame rows with `missing` values in specific columns? General Usage dataframes , missing-values	12	1077	February 7, 2022
Check if value is missing in DataFrame indexed by values (not by numerical indices) General Usage dataframes	1	1134	September 11, 2018
Iterate over all numeric columns in DataFrames Data	21	4849	February 11, 2018
How to single out valid data from DataArray? Data question	4	2241	June 8, 2017
Get the maximum value by row in a DataFrame containing missing values Data statistics , benchmark , dataframes , missing-values	10	152	August 30, 2024

How to get ranges of valid data in vectors or dataframes?

Related topics