do an inner join (i.a equals j.c || i.b equals j.e) and
output all the elements of the first df:
df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,2], d=["John", "Jim","Sally"],e=[1.,1.,3.])
x = @from i in df1 begin
@join j in df2 on i.a equals j.c || i.b equals j.e
@select i
@collect DataFrame
end
println(x)
[1] top-level scope at /home/au/Downloads/scratch.jl:23
ERROR: syntax: "j.e" is not a valid function argument name
Stacktrace:
[1] top-level scope at /home/au/Downloads/scratch.jl:7
This works (new example data so correct result is not empty):
df1 = DataFrame(a=[1,2,3], b=[1.,2.,3.])
df2 = DataFrame(c=[2,4,3], d=["John", "Jim","Sally"],e=[1.,1.,3.])
x = @from i in df1 begin
@join j in df2 on (i.a,i.b) equals (j.c,j.e)
@select i
@collect DataFrame
end
@test x[1,:] == df1[3,:]
the structured query interface makes my blocks of ugly data-munging code into concise, (often re-useable) statements
by facilitating joins across all kinds of data it lets me avoid issues stemming from loading huge flat files (I have an obscene amount of these) into memory to get at some random little bit of information
I’d say between DataFrames.jl DataFramesMeta.jl Pipe.jl DataConvenience.jl (for it’s export of Lazy.jl’s @> macro), you have everything covered, if I am not wrong the above is just an inner join which is just a one-liner in DataFrames.jl
x = innerjoin(df1, df2, on = [:a => :c, :b => :e])
As far as I understand, a major difference of those packages is that they materialize the dataframe after each operation, be it mapping/filtering/etc. Query.jl is lazy in this sense, and only collects the results when needed.
that might true. But performance-wise, I couldn’t get good group-by performance from Query.jl and I think it’s a limitation of its design. So personally, I don’t see myself using it, but I want to understand why ppl use it.