Suppose i have a dataframe which i want to filter using, let’s say, interquartile range, and i want to do it for each column (so if one element is filtered out it’s whole row is filteed out with it), but i don’t know which columns does the df have, or even how many. How would you do that?
Here’s an easy solution, without any special functions
julia> df = DataFrame(rand(100, 10), :auto);
julia> function iqr_filter(x)
q01, q99 = quantile(x, (0.01, 0.99))
@. (x < q01) | (x > q99)
end;
julia> to_filter = fill(false, nrow(df));
julia> for col in eachcol(df)
to_filter .= to_filter .| iqr_filter(col)
end;
julia> mean(to_filter)
0.17
julia> df[to_filter, :]
This solution uses chaining and DataFramesMeta.jl transformations, but is probably over-complicated
julia> @chain df begin
# Add an id column to model real-world usage
@transform :id = string.("id_", eachindex($1))
# Selects only numeric columns, could do something
# different
@aside begin
numeric_cols = names(_, Number)
end
# Re-do logic from above, see AsTable docs
@transform :to_filter = begin
nt = AsTable(numeric_cols)
x = fill(false, length(nt[1]))
for col in nt
x .= x .| iqr_filter(col)
end
x
end
@rsubset :to_filter
end
2 Likes
And here’s a complicated solution
subset(df, (names(df) .=> [x -> quantile(df[!, c], 0.05) .< x .< quantile(df[!, c], 0.95) for c ∈ names(df)])...)
1 Like