Dataframe Filter

Hi!
I was trying to filter data and view the result of some selective columns, but can’t figure out.

I created a dataframe first,

using DataFrames
df = DataFrame(
    name = ["a", "b", "c", "d"], 
    quiz_1 = [14,15,13,14], 
    quiz_2 = [18,16,16,17], 
    presentation = [7,6,6,8], 
    final = [42,46,47,49])
name	quiz_1	quiz_2	presentation	final
String	Int64	Int64	Int64	        Int64
1	a	14	    18	    7	            42
2	b	15	    16	    6	            46
3	c	13	    16	    6	            47
4	d	14	    17	    8	            49

In this dataframe, for example, I want to see only quiz_1 and quiz_2 columns, who scored more than 45 in the final.

I was able to filter using,

filter(row -> row.final > 45, df)
	name	quiz_1	quiz_2	presentation	final
String	    Int64	Int64	Int64	        Int64
1	b	    15	    16	    6	            46
2	c	    13	    16	    6	            47
3	d	    14	    17	    8	            49

Here, all the columns are shown in the output, but I can’t figure out how to get the result only of quiz_1 and quiz_2 columns. Can anyone help me out?

Hi, these are some basic options:
df[df.final.>45,[:quiz_1,:quiz_2]]
filter(row -> row.final > 45, df)[:,[:quiz_1,:quiz_2]]
DataFrame(i[[:quiz_1,:quiz_2]] for i in eachrow(df) if i.final>45)

2 Likes

You can also use DataFramesMeta.jl for dplyr-like syntax

@rsubset df :final > 45 # r stands for row-wise
1 Like

Thanks a lot for sharing the solution!

Thanks for the suggestion! Going through the documentation, pretty dplyr-like syntax indeed.

While the effect is minimal in this example, you can gain a bit of efficiency by passing only the column(s) you’re interested in to filter i.e.

julia> @btime filter(row -> row.final>45, df)
  18.120 μs (25 allocations: 1.88 KiB)

vs

julia> @btime filter(:final => f-> f>45, df)
  15.324 μs (24 allocations: 1.81 KiB)

for multiple

julia> @btime filter(row -> row.name=="b" && row.final>45, df)
  19.188 μs (25 allocations: 1.80 KiB)

vs

julia> @btime filter([:name, :final] => (n,f)-> n=="b" && f>45, df)
  18.529 μs (29 allocations: 1.98 KiB)
2 Likes

FWIW, for the specific OP’s dataframe example the type of solution proposed by AndiMD seems to run faster than all other alternatives shown:

@btime $df[$df.final .> 45, [:quiz_1,:quiz_2]]   # 1.320 μs (26 allocs: 1.7 KiB)
@btime $df[$df.name .== "b" .&& $df.final .> 45,[:quiz_1,:quiz_2]] # 1.610 μs (30 allocs: 1.8 KiB)
2 Likes