Dataframe Filter

janilin · March 25, 2022, 10:14pm

Hi!
I was trying to filter data and view the result of some selective columns, but can’t figure out.

I created a dataframe first,

using DataFrames
df = DataFrame(
    name = ["a", "b", "c", "d"], 
    quiz_1 = [14,15,13,14], 
    quiz_2 = [18,16,16,17], 
    presentation = [7,6,6,8], 
    final = [42,46,47,49])

name	quiz_1	quiz_2	presentation	final
String	Int64	Int64	Int64	        Int64
1	a	14	    18	    7	            42
2	b	15	    16	    6	            46
3	c	13	    16	    6	            47
4	d	14	    17	    8	            49

In this dataframe, for example, I want to see only quiz_1 and quiz_2 columns, who scored more than 45 in the final.

I was able to filter using,

filter(row -> row.final > 45, df)

	name	quiz_1	quiz_2	presentation	final
String	    Int64	Int64	Int64	        Int64
1	b	    15	    16	    6	            46
2	c	    13	    16	    6	            47
3	d	    14	    17	    8	            49

Here, all the columns are shown in the output, but I can’t figure out how to get the result only of quiz_1 and quiz_2 columns. Can anyone help me out?

AndiMD · March 25, 2022, 10:37pm

Hi, these are some basic options:
df[df.final.>45,[:quiz_1,:quiz_2]]
filter(row -> row.final > 45, df)[:,[:quiz_1,:quiz_2]]
DataFrame(i[[:quiz_1,:quiz_2]] for i in eachrow(df) if i.final>45)

pdeffebach · March 26, 2022, 12:02am

You can also use DataFramesMeta.jl for dplyr-like syntax

@rsubset df :final > 45 # r stands for row-wise

janilin · March 26, 2022, 5:27am

Thanks a lot for sharing the solution!

janilin · March 26, 2022, 5:32am

Thanks for the suggestion! Going through the documentation, pretty dplyr-like syntax indeed.

lawless-m · March 26, 2022, 7:17am

While the effect is minimal in this example, you can gain a bit of efficiency by passing only the column(s) you’re interested in to filter i.e.

julia> @btime filter(row -> row.final>45, df)
  18.120 μs (25 allocations: 1.88 KiB)

vs

julia> @btime filter(:final => f-> f>45, df)
  15.324 μs (24 allocations: 1.81 KiB)

for multiple

julia> @btime filter(row -> row.name=="b" && row.final>45, df)
  19.188 μs (25 allocations: 1.80 KiB)

vs

julia> @btime filter([:name, :final] => (n,f)-> n=="b" && f>45, df)
  18.529 μs (29 allocations: 1.98 KiB)

rafael.guerra · March 26, 2022, 7:38am

FWIW, for the specific OP’s dataframe example the type of solution proposed by AndiMD seems to run faster than all other alternatives shown:

@btime $df[$df.final .> 45, [:quiz_1,:quiz_2]]   # 1.320 μs (26 allocs: 1.7 KiB)
@btime $df[$df.name .== "b" .&& $df.final .> 45,[:quiz_1,:quiz_2]] # 1.610 μs (30 allocs: 1.8 KiB)

Topic		Replies	Views
Dataframe filtering based on conditions comparing one column to other columns New to Julia indexing , dataframes	6	2045	November 5, 2021
Is it possible to use filter(:col => myFilter, df) with additional constant input variables? New to Julia question , dataframes	12	788	January 5, 2022
DataFrame Filtering New to Julia question , dataframes	10	1094	September 20, 2023
What is the recommended way to filter rows of a Dataframe? Performance dataframes	6	299	July 23, 2024
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24001	April 27, 2024

Dataframe Filter

Related topics