Creating a function to produce stratum indicators efficiently

Hi all,

I am trying to create a function that takes as inputs df, key, and S, where

  • df: Data frame

  • key: Variable to group on (namely subject ID)

  • S: Number of strata

and creates a column in df named stratum specifies which row belongs to which stratum (s = 1, ..., S).

For example I want to produce something like…

 Row │ ID     rᵢ     y.           v      x1       x2         x0       stratum    
─────┼───────────────────────────────────────────────────────────────────
   1 │     1      1  1.5831     true      0.0  -0.522896      1.0      1
   2 │     1      2  4.01085    true      0.0  -0.522896      1.0      2
   3 │     1      3  4.39671    true      0.0  -0.522896      1.0      3
   4 │     1      4  4.84606    true      0.0  -0.522896      1.0      4
   5 │     1      5  4.99947    true      0.0  -0.522896      1.0      4
   6 │     1      6  5.26577   false      0.0  -0.522896      1.0      4
   7 │     2      1  0.113026   true      0.0  -0.780132      1.0      1
   8 │     2      2  0.849384   true      0.0  -0.780132      1.0      2
   9 │     2      3  2.25784    true      0.0  -0.780132      1.0      3
  10 │     2      4  3.01167    true      0.0  -0.780132      1.0      4
  11 │     2      5  4.98009    true      0.0  -0.780132      1.0      4
  12 │     2      6  5.24923    true      0.0  -0.780132      1.0      4
  13 │     2      7  5.25211    true      0.0  -0.780132      1.0      4
  14 │     2      8  5.27893    true      0.0  -0.780132      1.0      4
  15 │     2      9  5.36605    true      0.0  -0.780132      1.0      4
  16 │     2     10  5.72365   false      0.0  -0.780132      1.0      4
  17 │     3      1  0.362733   true      0.0   0.768787      1.0      1
  18 │     3      2  4.03361    true      0.0   0.768787      1.0      2
  19 │     3      3  7.0183     true      0.0   0.768787      1.0      3
  20 │     3      4  9.27818    true      0.0   0.768787      1.0      4
  21 │     3      5  9.70474    true      0.0   0.768787      1.0      4
  22 │     3      6  9.84579   false      0.0   0.768787      1.0      4

Essentially we match stratum with rᵢ up until 4 and stratum = 4 for all rᵢ > 4. This would coincide with an input of S = 4 .

So far, I have something like this coded up:

function createStrataByEvent(df :: DataFrame, key, S :: Int)

    for s in 1:S
        @chain df begin
            groupby(Symbol(key))
            @transform :stratum = ???
        end
    end
end

But am not sure how to create :stratum. At first glance, it seems like this can be achieved using an ifelse statement; however, I am not sure how to code the else portion if this is the way forward.

Would appreciate any guidance on this.

Thanks,
Eric

There maybe a more DataFrames-y way to do it, but here’s one way that works:

julia> function create_strata_by_event!(df::DataFrame, key, S::Int)
         gdf =  groupby(df, Symbol(key))
         for subdf in gdf
           subdf.stratum = clamp.(1:nrow(subdf), 1, S)
         end
       end
create_strata_by_event! (generic function with 1 method)

Note that this modifies the input dataframe itself to add a stratum column to it; so I’ve renamed the function according to Julia convention, including the bang ! at the end to indicate this mutation.

Thanks for the quick response.

Also, what is meant by “Julia convention”? I’m new to programming in Julia so I’m unfamiliar with the standard practice (also my first time seeing the ! used in action, even though I see it in several .jl functions.

For the !, see: Append ! to names of functions that modify their arguments

For the rest of the name, see: Use naming conventions consistent with Julia base/

functions are lowercase (maximum, convert) and, when readable, with multiple words squashed together (isequal, haskey). When necessary, use underscores as word separators.

The original name createStrataByEvent uses the CamelCase naming convention, which is generally not the preferred way to name functions in Julia. So instead, I changed it to use lowercase with underscores _ to separate the words (which is sometimes called the “snake case”).

By the way, an option using @chain, but based on the same overall logic as the above is:

julia> function create_strata_by_event(df::DataFrame, key, S::Int)
         @chain df begin
           groupby(Symbol(key))
           combine(:, Symbol(key) => (ids -> clamp.(1:length(ids), 1, S)) => :stratum)
         end
       end
create_strata_by_event (generic function with 1 method)

Note that in this case, this function doesn’t change the original dataframe df, but returns a new dataframe with the stratum column added. So in this case we’re choosing not to end the function name with a bang !.