Pretty basic question it seems to me, but I can’t figure out how to iterate across multiple dataframes. I’ve been trying to do this with Query.jl
Problem: First Dataframe contains names. Second Dataframe contains names. Count the number of times the names in First DataFrame appear in Second Dataframe.
using Query, DataFrames
df_one = DataFrame(name=["John", "Sally", "Kirk"])
df_two = DataFrame(name=["Sally", "Sally", "Kirk", "John", "Kirk", "Kirk", "John", "Sally", "Sally"])
name_tot = @from i in df_one begin
@from j in df_two
@where j.name == i.name
@let count = length(j.name)
@select count
@gather DataFrame
end
println(name_tot)
I’ve tried a few different variations, but none return the desired result; a DataFrame showing the counts of each name in Second Dataframe.
I’m sure it’s just a noob question and I I’m not understanding something fundamental.
@from i in df_one begin
@join j in df_two on i.name equals j.name into k
@select {i.name, count=length(k)}
@collect DataFrame
end
# output
3×2 DataFrames.DataFrame
│ Row │ name │ count │
├─────┼─────────┼───────┤
│ 1 │ "John" │ 2 │
│ 2 │ "Sally" │ 4 │
│ 3 │ "Kirk" │ 3 │
The kind of formulation you used will do an inner join: the two @from clauses generate the cross-product, and then the @where clause will filter it down to what you would have gotten from an inner join right away.
And so I was. Thanks for the solution and such a quick response. It looks amazingly like the examples in the documentation I’ve been starting at for the last hour. Didn’t even consider a group join as the way forward.