in which the second one is just whichever rows are left over after the subset so I don’t have to do another subset. I’m not sure how the internals of subset are handled but is it possible to somehow return only the index of the rows that match the condition? In this case you could then just run a single subset function, get the indices of subset1 and just do a set exclusion to get the indices of subset2.
As an extension, this would allow you to then perform another subset on subset2 to get subset3 in a similar way without having to look at the whole data frame again.
If this is possible I’d love to know how to do it, if not how would you recommend I perform these operations efficiently?
Finally, is there a nicer way of subsetting with the negation of a boolean column than this?
No, there is nothing built-in for subset with this. I think the way to do this would be findall and have a slightly more complicated expression outside of the chaining pipeline. DataFramesMeta.jl’s @with macro will help with this.
Yes! Please see DataFramesMeta which provides exactly the syntax you describe, with the help of metaprogramming.
Alternatively, one could transform to get a new column with the Boolean values and then perform a groupby to get the groups. This method is much faster.
For the second part, it should instead read
subset(df, :a => ByRow(!isequal(true)))
which I must say is still a little verbose but at least now it’s following best practices.
As mentioned in slack, I think transform and groupby is the best solution here. Grouped DataFrames seems like the most natural built-in object for this kind of thing.
df should strictly be a DataFrame here (parentindices always refers to source data frame).
This solution answers the question directly (as you ask if and how subset can produce the indices),
however in general subset is designed to be used in piping, so it does not normally produce a bare vector of indices and what @pdeffebach suggests (i.e. to add a column with a condition and then groupby it) is preferable assuming that you can afford to add an additional column to a data frame (which usually should be OK).
Now regarding the difference between ByRow(!isequal(true)) and ByRow(!) observe that subset, if skipmissing=true would treat missing as false, but just negating missing with ! is still missing, so it would be dropped twice. To show what I mean consider: