Subsetting boolean columns and multiple assignments

Hello, I have two related questions both related to the following code:

df = DataFrame(a = [true, true, false])
subset1 = subset(df, :a)
subset2 = subset(df, :a => ByRow(!=(true)))

I want to be able to do something like

subset1, subset2 = subset(df, :a, return_both = true)

in which the second one is just whichever rows are left over after the subset so I don’t have to do another subset. I’m not sure how the internals of subset are handled but is it possible to somehow return only the index of the rows that match the condition? In this case you could then just run a single subset function, get the indices of subset1 and just do a set exclusion to get the indices of subset2.

As an extension, this would allow you to then perform another subset on subset2 to get subset3 in a similar way without having to look at the whole data frame again.

If this is possible I’d love to know how to do it, if not how would you recommend I perform these operations efficiently?

Finally, is there a nicer way of subsetting with the negation of a boolean column than this?

subset(df, :a => ByRow(!=(true)))

Perhaps something like this?

subset(df, !:a)

No, there is nothing built-in for subset with this. I think the way to do this would be findall and have a slightly more complicated expression outside of the chaining pipeline. DataFramesMeta.jl’s @with macro will help with this.

Yes! Please see DataFramesMeta which provides exactly the syntax you describe, with the help of metaprogramming.

2 Likes

These answers comes from @bkamins and @ericphanson from Slack, posting here for permanence:

For the first part, getting indices can be done with parentindices resulting in the following example function:

function subset_returning_both(df::AbstractDataFrame, args...; skipmissing::Bool = false, view::Bool = false)
    subtestdf = subset(df, args...; skipmissing, view)
    indices = 1:size(df)[1]
    matching_indices = parentindices(subtestdf)[1]
    nonmatching_indices = setdiff(indices, matching_indices)
    return df[matching_indices, :], df[nonmatching_indices, :]
end

Alternatively, one could transform to get a new column with the Boolean values and then perform a groupby to get the groups. This method is much faster.

For the second part, it should instead read

subset(df, :a => ByRow(!isequal(true)))

which I must say is still a little verbose but at least now it’s following best practices.

2 Likes

As mentioned in slack, I think transform and groupby is the best solution here. Grouped DataFrames seems like the most natural built-in object for this kind of thing.

2 Likes

I think you can do just

subset(df, :a => ByRow(!))

if it really is Boolean and doesn’t contain missings for example.

2 Likes

Some small comments to the answers already given.

df should strictly be a DataFrame here (parentindices always refers to source data frame).

This solution answers the question directly (as you ask if and how subset can produce the indices),
however in general subset is designed to be used in piping, so it does not normally produce a bare vector of indices and what @pdeffebach suggests (i.e. to add a column with a condition and then groupby it) is preferable assuming that you can afford to add an additional column to a data frame (which usually should be OK).

Now regarding the difference between ByRow(!isequal(true)) and ByRow(!) observe that subset, if skipmissing=true would treat missing as false, but just negating missing with ! is still missing, so it would be dropped twice. To show what I mean consider:

julia> .!([true, false, missing])
3-element Vector{Union{Missing, Bool}}:
 false
  true
      missing

julia> (!isequal(true)).([true, false, missing])
3-element BitVector:
 0
 1
 1
3 Likes