Dataframe filtering based on conditions comparing one column to other columns

Hi,

I am trying to apply a condition to a selection of columns based on one column and the resulting dataframe.

Here is an example:

df
10×5 DataFrame
 Row │ x1         x2         x3         x4         x5       
     │ Float64    Float64    Float64    Float64    Float64  
─────┼──────────────────────────────────────────────────────
   1 │ 0.0981372  0.0804932  0.513072   0.450144   0.52068
   2 │ 0.630371   0.968476   0.556494   0.493968   0.347883
   3 │ 0.64709    0.643473   0.670181   0.765159   0.609178
   4 │ 0.394128   0.437874   0.782004   0.15423    0.58172
   5 │ 0.373165   0.783987   0.0135377  0.587961   0.328194
   6 │ 0.0741134  0.889961   0.0378584  0.345939   0.567646
   7 │ 0.311853   0.122679   0.0823076  0.615964   0.271999
   8 │ 0.880035   0.553531   0.491427   0.505312   0.50748
   9 │ 0.285774   0.187106   0.547379   0.193307   0.894698
  10 │ 0.0776693  0.0266613  0.658087   0.0714121  0.667334

How can I apply a filter (.< df.x5) on the first 2 columns and get all the columns as a result?

I tried this but it is giving me a boolean of the first 2 columns.

df[:, Not(Between(:x3,end))] .> df.x5

10×2 DataFrame
 Row │ x1     x2    
     │ Bool   Bool  
─────┼──────────────
   1 │ false  false
   2 │  true   true
   3 │  true   true
   4 │ false  false
   5 │  true   true
   6 │ false   true
   7 │  true  false
   8 │  true   true
   9 │ false  false
  10 │ false  false

I would like a resulting dataframe containing all the columns from x1 to x5 with the filtered row values

I hope this is clear enough …

Thanks!

There are loads of ways of doing this, maybe the simplest one is:

df[(df.x1 .> df.x5) .& (df.x2 .> df.x5), :]
1 Like

Thanks,

If I have more columns to do the filtering on, is there a way to set a condition that would exclude a certain number of columns (e.g. Not(:x3,end)) from the filtering while keeping them in the output?

Here’s one way:

julia> using DataFrames

julia> df = DataFrame(rand(8,5), :auto)

8×5 DataFrame
 Row │ x1         x2        x3        x4         x5       
     │ Float64    Float64   Float64   Float64    Float64  
─────┼────────────────────────────────────────────────────
   1 │ 0.39997    0.315291  0.564286  0.181416   0.272942
   2 │ 0.406449   0.136713  0.869279  0.856068   0.653893
   3 │ 0.642072   0.371682  0.581247  0.94918    0.106684
   4 │ 0.0550675  0.778223  0.953143  0.0357677  0.562967
   5 │ 0.0294628  0.146155  0.480069  0.237095   0.586552
   6 │ 0.621742   0.456891  0.635841  0.308975   0.630643
   7 │ 0.267329   0.634341  0.269092  0.169279   0.629037
   8 │ 0.864646   0.624525  0.553297  0.32189    0.245451

julia> columns = Cols(:x5, Not(Between(:x3, ncol(df))));

julia> myfilter(x, v...) = all(<(x), v);

julia> filter(columns => myfilter, df) # Or subset(df, columns => ByRow(myfilter))

3×5 DataFrame
 Row │ x1         x2        x3        x4        x5       
     │ Float64    Float64   Float64   Float64   Float64  
─────┼───────────────────────────────────────────────────
   1 │ 0.406449   0.136713  0.869279  0.856068  0.653893
   2 │ 0.0294628  0.146155  0.480069  0.237095  0.586552
   3 │ 0.621742   0.456891  0.635841  0.308975  0.630643
1 Like

Other solutions:

filter(df) do row
    all(<(row.x5), row[Not(Between(:x3, end))])
end
columns = names(df, Not(Between(:x3, ncol(df))))
subset(df, (vcat.(columns, "x5") .=> .<)...)
2 Likes

Thanks! - went for the “filter - row” solution.

Just out of curiosity - could you think of something with DataFramesMeta? using @rsubset maybe?

Note that the filter with do block is the slowest solution I think, but maybe it’s fine for your use case.

I don’t think DataFramesMeta helps in this case (its support for multi-column arguments is a bit rudimentary for now). The best I can come up with is

columns = Cols(:x5, Not(Between(:x3, ncol(df))))
myfilter(x, v...) = all(<(x), v)
@subset df $(columns => ByRow(myfilter))

which you could write in one line at the cost of readability…

Maybe @pdeffebach has a better idea?

1 Like