LINQ inner joins and select all in Query.jl

mkarikom · September 13, 2020, 5:54pm

This MWE is supposed to

do an inner join (i.a equals j.c || i.b equals j.e) and
output all the elements of the first df:

df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,2], d=["John", "Jim","Sally"],e=[1.,1.,3.])

x = @from i in df1 begin
    @join j in df2 on i.a equals j.c || i.b equals j.e
    @select i
    @collect DataFrame
end

println(x)

[1] top-level scope at /home/au/Downloads/scratch.jl:23
ERROR: syntax: "j.e" is not a valid function argument name
Stacktrace:
[1] top-level scope at /home/au/Downloads/scratch.jl:7

xiaodai · September 14, 2020, 4:31am

I am trying to solve it.

BTW, what attracted you to use Query.jl in the first place? Is it because it sounds like the equiavelnt of tidyverse in Julia?

xiaodai · September 14, 2020, 4:38am

doesn’t look like possible. better ask on query github repo.

mkarikom · September 14, 2020, 6:49am

This works (new example data so correct result is not empty):

df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,3], d=["John", "Jim","Sally"],e=[1.,1.,3.])

x = @from i in df1 begin
    @join j in df2 on (i.a,i.b) equals (j.c,j.e)
    @select i
    @collect DataFrame
end

@test x[1,:] == df1[3,:]

mkarikom · September 14, 2020, 6:58am

mainly:

the structured query interface makes my blocks of ugly data-munging code into concise, (often re-useable) statements
by facilitating joins across all kinds of data it lets me avoid issues stemming from loading huge flat files (I have an obscene amount of these) into memory to get at some random little bit of information

xiaodai · September 14, 2020, 11:23am

I’d say between DataFrames.jl DataFramesMeta.jl Pipe.jl DataConvenience.jl (for it’s export of Lazy.jl’s @> macro), you have everything covered, if I am not wrong the above is just an inner join which is just a one-liner in DataFrames.jl

x = innerjoin(df1, df2, on = [:a => :c, :b => :e])

aplavin · September 14, 2020, 12:59pm

As far as I understand, a major difference of those packages is that they materialize the dataframe after each operation, be it mapping/filtering/etc. Query.jl is lazy in this sense, and only collects the results when needed.

xiaodai · September 14, 2020, 2:59pm

that might true. But performance-wise, I couldn’t get good group-by performance from Query.jl and I think it’s a limitation of its design. So personally, I don’t see myself using it, but I want to understand why ppl use it.

Topic		Replies	Views
Iterate across two DataFrames using Query.jl New to Julia query	5	1165	November 19, 2018
Accessing full dataframe after a join within a Query.jl query Data	4	1068	April 7, 2017
Query.jl - Return all columns with @map Data question	3	1225	August 5, 2019
Arbitrary table join conditions Data package , data , dataframes , splitapplycombine	9	1771	August 16, 2020
Expressiveness for queries Performance dataframes	3	258	June 6, 2022

LINQ inner joins and select all in Query.jl

Related topics