Filtering dataframe row by row procedurally?

Suppose i have a dataframe which i want to filter using, let’s say, interquartile range, and i want to do it for each column (so if one element is filtered out it’s whole row is filteed out with it), but i don’t know which columns does the df have, or even how many. How would you do that?

Here’s an easy solution, without any special functions


julia> df = DataFrame(rand(100, 10), :auto);

julia> function iqr_filter(x)
           q01, q99 = quantile(x, (0.01, 0.99))
           @. (x < q01) | (x > q99)
       end;

julia> to_filter = fill(false, nrow(df));

julia> for col in eachcol(df)
           to_filter .= to_filter .| iqr_filter(col)
       end;

julia> mean(to_filter)
0.17

julia> df[to_filter, :]

This solution uses chaining and DataFramesMeta.jl transformations, but is probably over-complicated

julia> @chain df begin 
           # Add an id column to model real-world usage
           @transform :id = string.("id_", eachindex($1))
           # Selects only numeric columns, could do something
           # different
           @aside begin
               numeric_cols = names(_, Number)
           end
           # Re-do logic from above, see AsTable docs
           @transform :to_filter = begin
               nt = AsTable(numeric_cols)
               x = fill(false, length(nt[1]))
               for col in nt
                   x .= x .| iqr_filter(col)
               end
               x
           end
           @rsubset :to_filter
       end
2 Likes

And here’s a complicated solution :smiley:

subset(df, (names(df) .=> [x -> quantile(df[!, c], 0.05) .< x .< quantile(df[!, c], 0.95) for c ∈ names(df)])...)
1 Like