How to filter InMemoryDatasets

ufechner7 · June 28, 2022, 7:29pm

Example:

using InMemoryDatasets
ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
res = modify!(ds, :x2 => byrow(isodd)  => :ODD)

The output is;

julia> include("test/filter.jl")
10×4 Dataset
 Row │ x1        x2        x3        ODD      
     │ identity  identity  identity  identity 
     │ Int64?    Int64?    Int64?    Bool?    
─────┼────────────────────────────────────────
   1 │        1         1         1      true
   2 │        1         2         2     false
   3 │        1         3         1      true
   4 │        1         4         2     false
   5 │        1         5         1      true
   6 │        1         6         2     false
   7 │        1         7         1      true
   8 │        1         8         2     false
   9 │        1         9         1      true
  10 │        1        10         2     false

How can I filter res for all values were ds.ODD == true ?

ufechner7 · June 28, 2022, 8:04pm

Ok, found a solution:

res[res[!, :ODD] .== true, :]

A bit strange (but nice) that I can column numbers or column names as index…

rocco_sprmnt21 · June 29, 2022, 6:38am

you ould try one of these way
here the related doc

julia> filter(ds, [:x2,:x3], by =[>(5),isodd])
2×3 Dataset
 Row │ x1        x2        x3       
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         7         1
   2 │        1         9         1

julia> filter(ds, [:x2,:x3], type=any, by =[>(5),isodd])
8×3 Dataset
 Row │ x1        x2        x3       
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         1         1
   2 │        1         3         1
   3 │        1         5         1
   4 │        1         6         2
   5 │        1         7         1
   6 │        1         8         2
   7 │        1         9         1
   8 │        1        10         2

julia> filter(ds, 2:3, type=any, by =[>(5),iseven])
7×3 Dataset
 Row │ x1        x2        x3       
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         2         2
   2 │        1         4         2
   3 │        1         6         2
   4 │        1         7         1
   5 │        1         8         2
   6 │        1         9         1
   7 │        1        10         2

rocco_sprmnt21 · June 29, 2022, 9:23am

Could the following comparison be of interest to you?
I don’t know if the result obtained in this specific case is generalizable and also valid for the real cases of your interest.

using SplitApplyCombine, TypedTables, BenchmarkTools
t = Table(x1 = fill(1,10), x2 = collect(1:10), x3 = repeat(1:2, 5))
@btime filterview(r->r.x2>(5)&&isodd(r.x2),rows(t))


julia> @btime filterview(r->r.x2>(5)&&isodd(r.x2),rows(t))
  429.146 ns (9 allocations: 448 bytes)
Table with 3 columns and 2 rows:
     x1  x2  x3
   ┌───────────
 1 │ 1   7   1
 2 │ 1   9   1


#while the same operation with IMD


julia> @btime filter(ds, [:x2,:x3], by =[>(5),isodd])
  7.050 μs (67 allocations: 5.28 KiB)
2×3 Dataset
 Row │ x1        x2        x3       
     │ identity  identity  identity
     │ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 │        1         7         1
   2 │        1         9         1



# and with DF 

julia> @btime subset(df, :x2=> ByRow(x-> x>5 && isodd(x)) )
  13.400 μs (163 allocations: 8.62 KiB)
2×3 DataFrame
 Row │ x1     x2     x3    
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      7      1
   2 │     1      9      1

monopolynomial · July 1, 2022, 3:51am

filterview returns a view, adding view=true gives better performance

@btime filter(ds, [:x2,:x3], by =[>(5),isodd],view=true)

but I find mapformats more interesting

ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
setformat!(ds,:x2=>isodd)
filter(ds,:x2,mapformats=true,view=true)
removeformat!(ds,:x2)

I run your benchmark with a little larger data set and the results are different:

ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
repeat!(ds,10^5)
t=Table(ds)
@btime filterview(r->r.x2>(5)&&isodd(r.x2),rows(t))
  286.256 ms (5000011 allocations: 169.49 MiB)
@btime filter(ds, [:x2,:x3], by =[>(5),isodd], view = true)
  884.615 μs (125 allocations: 2.60 MiB)

aplavin · July 1, 2022, 11:04am

I get 200 times better performance:

julia> tbl = (x1 = repeat([1], 10), x2 = 1:10, x3 = repeat(1:2, 5)) |> rowtable
julia> tbl_L = repeat(tbl, 10^5);
julia> using TypedTables
julia> @btime filterview(r->r.x2>(5)&&isodd(r.x3), rows($(Table(tbl_L))));
  1.257 ms (9 allocations: 1.65 MiB)

rocco_sprmnt21 · July 1, 2022, 1:14pm

in fact I had already done some tests with bigger tables.

Although I made a wrong comparison in the first place because the conditions in the two queries were different.
But even “competing” in the same way, filterview comes first, according to my pc

julia> t = Table(x1 = fill(1,10^5), x2 = collect(1:10^5), x3 = collect(1:2:2*10^5));     

julia> @btime filterview(r->r.x2>(5) && isodd(r.x2),rows(t));
  70.200 μs (11 allocations: 407.53 KiB)

julia> @btime filterview(r->r.x2>(5) && isodd(r.x3),rows(t));
  154.900 μs (11 allocations: 798.16 KiB)

julia> ds = Dataset(x1 = 1, x2 = 1:10^5, x3 = 1:2:2*10^5);

julia> @btime filter(ds, [:x2,:x3], by =[>(5),isodd],view=true);
  263.700 μs (42 allocations: 893.73 KiB)

the view version of DataFrame the fastest of the 3


julia> @btime subset(df, [:x2,:x3]=> ByRow((x,y)-> x>5 && isodd(y)), view=true); 
  131.400 μs (460 allocations: 143.00 KiB)

julia> df = DataFrame(x1 = 1, x2 = 1:10^5, x3 = 1:2:2*10^5);

DataFrames · July 23, 2022, 5:14am

t=Table(ds) creates table with columns including missing and this might be the reason for having such a poor performance of filterview.
I guess, in general, InMemoryDatasets should be the fastest one, because it uses parallel computation but DataFrames and SplitApplyCombine are single threaded.

Topic		Replies	Views
How to use filter from inmemorydatasets package Data question , inmemorydatasets	3	404	November 3, 2022
Byrow with user defined function Data question , inmemorydatasets	5	716	August 25, 2022
Does InMemoryDatasets support the Tables interface? General Usage question , tables , inmemorydatasets	1	356	August 10, 2022
Column types in DataFrames vs. InMemoryDatasets General Usage dataframes , inmemorydatasets	6	963	March 29, 2022
Dataframe Filter New to Julia question , dataframes	6	5917	March 26, 2022

How to filter InMemoryDatasets

Related topics