LINQ inner joins and select all in Query.jl

This MWE is supposed to

  1. do an inner join (i.a equals j.c || i.b equals j.e) and
  2. output all the elements of the first df:
df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,2], d=["John", "Jim","Sally"],e=[1.,1.,3.])

x = @from i in df1 begin
    @join j in df2 on i.a equals j.c || i.b equals j.e
    @select i
    @collect DataFrame
end

println(x)

[1] top-level scope at /home/au/Downloads/scratch.jl:23
ERROR: syntax: "j.e" is not a valid function argument name
Stacktrace:
[1] top-level scope at /home/au/Downloads/scratch.jl:7

I am trying to solve it.

BTW, what attracted you to use Query.jl in the first place? Is it because it sounds like the equiavelnt of tidyverse in Julia?

doesn’t look like possible. better ask on query github repo.

This works (new example data so correct result is not empty):

df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,3], d=["John", "Jim","Sally"],e=[1.,1.,3.])

x = @from i in df1 begin
    @join j in df2 on (i.a,i.b) equals (j.c,j.e)
    @select i
    @collect DataFrame
end

@test x[1,:] == df1[3,:]
1 Like

mainly:

  1. the structured query interface makes my blocks of ugly data-munging code into concise, (often re-useable) statements
  2. by facilitating joins across all kinds of data it lets me avoid issues stemming from loading huge flat files (I have an obscene amount of these) into memory to get at some random little bit of information

I’d say between DataFrames.jl DataFramesMeta.jl Pipe.jl DataConvenience.jl (for it’s export of Lazy.jl’s @> macro), you have everything covered, if I am not wrong the above is just an inner join which is just a one-liner in DataFrames.jl

x = innerjoin(df1, df2, on = [:a => :c, :b => :e])

As far as I understand, a major difference of those packages is that they materialize the dataframe after each operation, be it mapping/filtering/etc. Query.jl is lazy in this sense, and only collects the results when needed.

1 Like

that might true. But performance-wise, I couldn’t get good group-by performance from Query.jl and I think it’s a limitation of its design. So personally, I don’t see myself using it, but I want to understand why ppl use it.