It would be great if DataFrames.jl had a function or functions that would function more or less the same way subset
does, except that they’d would return a vector containing the indices of kept rows instead of a new frame. This vector would be suitable for subsequent row indexing. (Thankfully this function already more or less exists already.) For example, you’d have something like this:
julia> df = allcombinations(DataFrame, Symbol("col 1")=>1:5, Symbol("col 2")=>1:5); df[!, "col 3"] = missings(String, nrow(df)); df
25×3 DataFrame
Row │ col 1 col 2 col 3
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 1 1 missing
2 │ 2 1 missing
3 │ 3 1 missing
⋮ │ ⋮ ⋮ ⋮
23 │ 3 5 missing
24 │ 4 5 missing
25 │ 5 5 missing
19 rows omitted
julia> #= current way (BitVector) =# (df[!, "col 1"] .% 2 .== 0) .&& (df[!, "col 2"] .% 2 .== 1)
25-element BitVector:
0
1
0
1
0
⋮
0
1
0
1
0
julia> #= current way (indices) =# findall((df[!, "col 1"] .% 2 .== 0) .&& (df[!, "col 2"] .% 2 .== 1))
6-element Vector{Int64}:
2
4
12
14
22
24
julia> subset_conditions(df, selectors...; skipmissing::Bool=false, threads::Bool=true) = DataFrames._get_subset_conditions(df, Ref{Any}(selectors), skipmissing, threads);^C
julia> #= proposed way =# subset_conditions(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1)
25-element BitVector:
0
1
0
1
0
⋮
0
1
0
1
0
julia> subset_indices(df, selectors...; skipmissing::Bool=false, threads::Bool=true) = findall(DataFrames._get_subset_conditions(df, Ref{Any}(selectors), skipmissing, threads));
julia> #= proposed way =# subset_indices(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1)
6-element Vector{Int64}:
2
4
12
14
22
24
Since these are suitable for indexing, you can do something like this:
julia> df[subset_indices(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1), "col 3"] .= "even,odd"; df
25×3 DataFrame
Row │ col 1 col 2 col 3
│ Int64 Int64 String?
─────┼────────────────────────
1 │ 1 1 missing
2 │ 2 1 even,odd
3 │ 3 1 missing
⋮ │ ⋮ ⋮ ⋮
23 │ 3 5 missing
24 │ 4 5 even,odd
25 │ 5 5 missing
19 rows omitted
For a simple example like this not much is gained, but for more complicated functions I think it begins to be worth it — especially if you have ByRow
transformations that are tricky to express with broadcasting.