DataFrame Filtering

In following example I want to filter few things specific from both column at same time. I can get end result that I showed below by filtering separately and then I can combine to get end DataFrame but it doesn’t feel right. I’m pretty sure there must be some way to get same result by just one filtering step where I can filter multiple things together.

julia> x = DataFrame(a = repeat([ "s", "p", "t", "q"] , inner = 4, outer = 1), b = rand(1:10, 16))
16×2 DataFrame
 Row │ a       b     
     │ String  Int64 
─────┼───────────────
   1 │ s           4
   2 │ s          10
   3 │ s           4
   4 │ s           3
   5 │ p           4
   6 │ p           7
   7 │ p           1
   8 │ p           6
   9 │ t           9
  10 │ t           9
  11 │ t           3
  12 │ t           4
  13 │ q           6
  14 │ q           9
  15 │ q           5
  16 │ q           3

end result I want after filtering.

5×2 DataFrame
 Row │ a       b     
     │ String  Int64 
─────┼───────────────
   1 │ s           3
   2 │ p           7
   3 │ t           9
   4 │ t           9
   5 │ q           6

What’s your criterion for selecting the rows of the dataframe?

According to above example. something like this

filter(column -> column.a == "s" && column.b <4, x)
filter(column -> column.a == "p" && column.b >6, x)
filter(column -> column.a == "t" && column.b == 9, x)
filter(column -> column.a == "q" && column.b == 6, x)

I want to combine all these filters together as single filtering step.

x[(x.a .== "s" .&& x.b .< 4) .|| (x.a .== "p" .&& x.b .> 6) .|| ...]
2 Likes

I’m getting this Error.

julia> x[(x.a .== "s" .&& x.b .< 4) .|| (x.a .== "p" .&& x.b .> 6) .|| (x.a .== "t" .&& x.b .== 9) .|| (x.a .== "q" .&& x.b .== 6)]
ERROR: MethodError: no method matching getindex(::DataFrame, ::BitVector)

Closest candidates are:
  getindex(::DataFrame, ::AbstractVector{T}, ::Colon) where T
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/dataframe.jl:605
  getindex(::DataFrame, ::AbstractVector{T}, ::Union{Colon, Regex, All, Between, Cols, InvertedIndex, AbstractVector}) where T
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/dataframe.jl:579
  getindex(::DataFrame, ::AbstractVector, ::Union{AbstractString, Signed, Symbol, Unsigned})
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/dataframe.jl:530
  ...

You need a column selector too:

x[ <logical expression> , :]

Using filter you may write

filter(row -> begin
                row.a == "s" && row.b <4   || 
                row.a == "p" && row.b >6   ||
                row.a == "t" && row.b == 9 ||
                row.a == "q" && row.b == 6
              end, x)
5 Likes

@rsubset from DataFramesMeta may be helpful here. I would still write these filters on multiple lines, so they are easier to read though, even if you do technically do it in one step with a begin block.

2 Likes

Thank you everyone for help :slight_smile:

DataFrame is 2-dimensiional, so you need

x[(x.a .== "s" .&& x.b .< 4) .|| (x.a .== "p" .&& x.b .> 6) .|| (x.a .== "t" .&& x.b .== 9) .|| (x.a .== "q" .&& x.b .== 6), :]

(notice the , :] at the end)

1 Like

Can these strategies also work for loop scenario? For instance lets say I have 200 rows from where I need to select based on some condition and it will really crazy to write down each filter criteria individually.

something like this can be helpful.

for i in 1:200 
     new = filter(row -> row.a == y[i] && row.b == z[i], x )
end 

#where y and z contain values of rows for column a and b respectively for filtering. 

Like this?

va = ["s", "p", "t", "q"]
vb = [4,6,9,6]

function myfilter(va,vb, x) 
    mapreduce(vcat, eachindex(va,vb)) do i
        filter(row -> row.a == va[i] && row.b == vb[i], x )
    end
end

myfilter(va,vb,x)

I’ve put the mapreduce into a function, otherwise you end up with massive compilation due to the anonymous functions inside filter every time that code block executes.

You can also programmatically create a predicate function that will require only one call to filter.

# this creates an anonymous predicate to be passed into filter
make_filter(x,y) = row->any(s->row.a==s[1] && row.b==s[2], zip(x,y))

flt = make_filter(va, vb)

filter(flt, x)
5×2 DataFrame
 Row │ a       b     
     │ String  Int64 
─────┼───────────────
   1 │ s           4
   2 │ p           6
   3 │ p           6
   4 │ t           9
   5 │ q           6
1 Like