Example:
using InMemoryDatasets
ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
res = modify!(ds, :x2 => byrow(isodd) => :ODD)
The output is;
julia> include("test/filter.jl")
10×4 Dataset
Row │ x1 x2 x3 ODD
│ identity identity identity identity
│ Int64? Int64? Int64? Bool?
─────┼────────────────────────────────────────
1 │ 1 1 1 true
2 │ 1 2 2 false
3 │ 1 3 1 true
4 │ 1 4 2 false
5 │ 1 5 1 true
6 │ 1 6 2 false
7 │ 1 7 1 true
8 │ 1 8 2 false
9 │ 1 9 1 true
10 │ 1 10 2 false
How can I filter res
for all values were ds.ODD == true ?
Ok, found a solution:
res[res[!, :ODD] .== true, :]
A bit strange (but nice) that I can column numbers or column names as index…
you ould try one of these way
here the related doc
julia> filter(ds, [:x2,:x3], by =[>(5),isodd])
2×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 7 1
2 │ 1 9 1
julia> filter(ds, [:x2,:x3], type=any, by =[>(5),isodd])
8×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 1 1
2 │ 1 3 1
3 │ 1 5 1
4 │ 1 6 2
5 │ 1 7 1
6 │ 1 8 2
7 │ 1 9 1
8 │ 1 10 2
julia> filter(ds, 2:3, type=any, by =[>(5),iseven])
7×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 2 2
2 │ 1 4 2
3 │ 1 6 2
4 │ 1 7 1
5 │ 1 8 2
6 │ 1 9 1
7 │ 1 10 2
Could the following comparison be of interest to you?
I don’t know if the result obtained in this specific case is generalizable and also valid for the real cases of your interest.
using SplitApplyCombine, TypedTables, BenchmarkTools
t = Table(x1 = fill(1,10), x2 = collect(1:10), x3 = repeat(1:2, 5))
@btime filterview(r->r.x2>(5)&&isodd(r.x2),rows(t))
julia> @btime filterview(r->r.x2>(5)&&isodd(r.x2),rows(t))
429.146 ns (9 allocations: 448 bytes)
Table with 3 columns and 2 rows:
x1 x2 x3
┌───────────
1 │ 1 7 1
2 │ 1 9 1
#while the same operation with IMD
julia> @btime filter(ds, [:x2,:x3], by =[>(5),isodd])
7.050 μs (67 allocations: 5.28 KiB)
2×3 Dataset
Row │ x1 x2 x3
│ identity identity identity
│ Int64? Int64? Int64?
─────┼──────────────────────────────
1 │ 1 7 1
2 │ 1 9 1
# and with DF
julia> @btime subset(df, :x2=> ByRow(x-> x>5 && isodd(x)) )
13.400 μs (163 allocations: 8.62 KiB)
2×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 7 1
2 │ 1 9 1
filterview returns a view, adding view=true gives better performance
@btime filter(ds, [:x2,:x3], by =[>(5),isodd],view=true)
but I find mapformats more interesting
ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
setformat!(ds,:x2=>isodd)
filter(ds,:x2,mapformats=true,view=true)
removeformat!(ds,:x2)
I run your benchmark with a little larger data set and the results are different:
ds = Dataset(x1 = 1, x2 = 1:10, x3 = repeat(1:2, 5))
repeat!(ds,10^5)
t=Table(ds)
@btime filterview(r->r.x2>(5)&&isodd(r.x2),rows(t))
286.256 ms (5000011 allocations: 169.49 MiB)
@btime filter(ds, [:x2,:x3], by =[>(5),isodd], view = true)
884.615 μs (125 allocations: 2.60 MiB)
I get 200 times better performance:
julia> tbl = (x1 = repeat([1], 10), x2 = 1:10, x3 = repeat(1:2, 5)) |> rowtable
julia> tbl_L = repeat(tbl, 10^5);
julia> using TypedTables
julia> @btime filterview(r->r.x2>(5)&&isodd(r.x3), rows($(Table(tbl_L))));
1.257 ms (9 allocations: 1.65 MiB)
in fact I had already done some tests with bigger tables.
Although I made a wrong comparison in the first place because the conditions in the two queries were different.
But even “competing” in the same way, filterview comes first, according to my pc
julia> t = Table(x1 = fill(1,10^5), x2 = collect(1:10^5), x3 = collect(1:2:2*10^5));
julia> @btime filterview(r->r.x2>(5) && isodd(r.x2),rows(t));
70.200 μs (11 allocations: 407.53 KiB)
julia> @btime filterview(r->r.x2>(5) && isodd(r.x3),rows(t));
154.900 μs (11 allocations: 798.16 KiB)
julia> ds = Dataset(x1 = 1, x2 = 1:10^5, x3 = 1:2:2*10^5);
julia> @btime filter(ds, [:x2,:x3], by =[>(5),isodd],view=true);
263.700 μs (42 allocations: 893.73 KiB)
the view version of DataFrame the fastest of the 3
julia> @btime subset(df, [:x2,:x3]=> ByRow((x,y)-> x>5 && isodd(y)), view=true);
131.400 μs (460 allocations: 143.00 KiB)
julia> df = DataFrame(x1 = 1, x2 = 1:10^5, x3 = 1:2:2*10^5);
t=Table(ds)
creates table with columns including missing
and this might be the reason for having such a poor performance of filterview
.
I guess, in general, InMemoryDatasets
should be the fastest one, because it uses parallel computation but DataFrames
and SplitApplyCombine
are single threaded.