Iterate across two DataFrames using Query.jl

query

#1

Pretty basic question it seems to me, but I can’t figure out how to iterate across multiple dataframes. I’ve been trying to do this with Query.jl

Problem: First Dataframe contains names. Second Dataframe contains names. Count the number of times the names in First DataFrame appear in Second Dataframe.

using Query, DataFrames

df_one = DataFrame(name=["John", "Sally", "Kirk"])
df_two = DataFrame(name=["Sally", "Sally", "Kirk", "John", "Kirk", "Kirk", "John", "Sally", "Sally"])

name_tot = @from i in df_one begin
    @from j in df_two
    @where j.name == i.name
    @let count = length(j.name)
    @select count
    @gather DataFrame
    end

println(name_tot)

I’ve tried a few different variations, but none return the desired result; a DataFrame showing the counts of each name in Second Dataframe.

I’m sure it’s just a noob question and I I’m not understanding something fundamental.


#2

I think you are looking for a group join:

@from i in df_one begin
@join j in df_two on i.name equals j.name into k
@select {i.name, count=length(k)}
@collect DataFrame                               
end

# output

3×2 DataFrames.DataFrame
│ Row │ name    │ count │
├─────┼─────────┼───────┤
│ 1   │ "John"  │ 2     │
│ 2   │ "Sally" │ 4     │
│ 3   │ "Kirk"  │ 3     │

The kind of formulation you used will do an inner join: the two @from clauses generate the cross-product, and then the @where clause will filter it down to what you would have gotten from an inner join right away.


#3

And so I was. Thanks for the solution and such a quick response. It looks amazingly like the examples in the documentation I’ve been starting at for the last hour. Didn’t even consider a group join as the way forward.

Thanks again.


#4

A related follow on… Not sure if it’s a bug or by design, but it would appear new column names can’t be exactly the same as range variables.

df_new = @from i in df_one begin
        @join j in df_two on i.name equals j.name into k
        @let count = length(k)
        @select {i.name, count=count}
        @collect DataFrame                               
        end
   
println(df_new)

#output

3×2 DataFrames.DataFrame
│ Row │ name    │ _2_ │
├─────┼─────────┼─────┤
│ 1   │ "John"  │ 2   │
│ 2   │ "Sally" │ 4   │
│ 3   │ "Kirk"  │ 3   │

I did expect that new column to be called “count” not “2”. Anything to the left of an “=” in a @select statement to be considered a column name?


#5

That looks like a bug to me: https://github.com/davidanthoff/Query.jl/issues/163