(DataFrames.jl Suggestion) A (public) function that takes the same args as `subset` and returns the matched indices

It would be great if DataFrames.jl had a function or functions that would function more or less the same way subset does, except that they’d would return a vector containing the indices of kept rows instead of a new frame. This vector would be suitable for subsequent row indexing. (Thankfully this function already more or less exists already.) For example, you’d have something like this:

julia> df = allcombinations(DataFrame, Symbol("col 1")=>1:5, Symbol("col 2")=>1:5); df[!, "col 3"] = missings(String, nrow(df)); df
25×3 DataFrame
 Row │ col 1  col 2  col 3   
     │ Int64  Int64  String? 
─────┼───────────────────────
   1 │     1      1  missing 
   2 │     2      1  missing 
   3 │     3      1  missing 
  ⋮  │   ⋮      ⋮       ⋮
  23 │     3      5  missing 
  24 │     4      5  missing 
  25 │     5      5  missing 
              19 rows omitted

julia> #= current way (BitVector) =# (df[!, "col 1"] .% 2 .== 0) .&& (df[!, "col 2"] .% 2 .== 1)
25-element BitVector:
 0
 1
 0
 1
 0
 ⋮
 0
 1
 0
 1
 0

julia> #= current way (indices) =# findall((df[!, "col 1"] .% 2 .== 0) .&& (df[!, "col 2"] .% 2 .== 1))
6-element Vector{Int64}:
  2
  4
 12
 14
 22
 24

julia> subset_conditions(df, selectors...; skipmissing::Bool=false, threads::Bool=true) = DataFrames._get_subset_conditions(df, Ref{Any}(selectors), skipmissing, threads);^C

julia> #= proposed way =# subset_conditions(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1)
25-element BitVector:
 0
 1
 0
 1
 0
 ⋮
 0
 1
 0
 1
 0

julia> subset_indices(df, selectors...; skipmissing::Bool=false, threads::Bool=true) = findall(DataFrames._get_subset_conditions(df, Ref{Any}(selectors), skipmissing, threads));

julia> #= proposed way =# subset_indices(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1)
6-element Vector{Int64}:
  2
  4
 12
 14
 22
 24

Since these are suitable for indexing, you can do something like this:

julia> df[subset_indices(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1), "col 3"] .= "even,odd"; df
25×3 DataFrame
 Row │ col 1  col 2  col 3    
     │ Int64  Int64  String?  
─────┼────────────────────────
   1 │     1      1  missing  
   2 │     2      1  even,odd
   3 │     3      1  missing  
  ⋮  │   ⋮      ⋮       ⋮
  23 │     3      5  missing  
  24 │     4      5  even,odd
  25 │     5      5  missing  
               19 rows omitted

For a simple example like this not much is gained, but for more complicated functions I think it begins to be worth it — especially if you have ByRow transformations that are tricky to express with broadcasting.

Can you please open an issue in DataFrames.jl for this so that we can keep track of this request?

Here I can comment why we do not expose such a function currently (but maybe we should change our decision). The reason is that the ecosystem was design with the principle that functions should return a data frame. This is useful because it makes chaining several operations convenient.