Accessing full dataframe after a join within a Query.jl query

This is transferring my question from https://github.com/davidanthoff/Query.jl/issues/104#issuecomment-291908459 as suggested by @davidanthoff. I have two dataframes that look like this:

df1 = DataFrame(id=[1,2,3], rel_freq=[0.5, 1, 1.5], class=["a", "a", "c"])
df2 = DataFrame(id=[1,2,3,4], rel_freq=[0.75, 0.75, 1.75, 1], class=["a", "a", "c", "c"])
println(df1)
println(df2)

#output

3×3 DataFrames.DataFrame
│ Row │ id │ rel_freq │ class │
├─────┼────┼──────────┼───────┤
│ 1   │ 1  │ 0.5      │ "a"   │
│ 2   │ 2  │ 1.0      │ "a"   │
│ 3   │ 3  │ 1.5      │ "c"   │
4×3 DataFrames.DataFrame
│ Row │ id │ rel_freq │ class │
├─────┼────┼──────────┼───────┤
│ 1   │ 1  │ 0.75     │ "a"   │
│ 2   │ 2  │ 0.75     │ "a"   │
│ 3   │ 3  │ 1.75     │ "c"   │
│ 4   │ 4  │ 1.0      │ "c"   │

I want to calculate pvalues using a Mann Whitney test between the diffs of the rel_freqs for all classes and compare them to class c.

##Desired approach (single query)

The following is my attempt to do this in one query:

x = @from i in df1 begin
    @join j in df2 on i.id equals j.id
    @select {i.id, log2fc=log2(get(j.rel_freq)/get(i.rel_freq)), i.class} into k
    @let negcontrol = @from l in k begin # try to build a list of log2fc of the negative controls
        @where contains(l.class, "c")
        @select l.log2fc
        @collect
    end
    @group k by k.gene into g
    @let test = MannWhitneyUTest(map(j->get(j.log2fc), g), negcontrol) #compare this group versus neg controls
    @select {
        gene=g.key, 
        avg_freq=mean(j->j.log2fc, g),
        pval= -log10(pvalue(test)) # calculate p value for this group versus neg controls
    }
    @collect DataFrame
end

The main problem seems to be that I select all rows of the joined dataframe that contain “c” for class and then create an array of the log2fc column for all of them to use in the Mann-Whitney test.

Working approach (multiple separate queries)

Here’s code that gives the correct result, but doesn’t do so in a single query:

x = @from i in df1 begin
    @join j in df2 on i.id equals j.id
    @select {i.id, log2fc=log2(get(j.rel_freq)/get(i.rel_freq)), i.class}
    @collect DataFrame
end

y = @from i in x begin
    @where contains(get(i.class), "c")
    @select i.log2fc
    @collect 
end

z = @from i in x begin
    @group i by i.class into g
    @where !contains(get(g[1].class), "c")
    @let test = MannWhitneyUTest(map(j->j.log2fc, g), y)
    @select {
        gene = get(g.key), 
        med_log2fc = median(map(j->j.log2fc, g)),
        pval = -log10(pvalue(test)),
    }
    @collect DataFrame
end

Desired output

z
# output

1×3 DataFrames.DataFrame
│ Row │ gene │ med_log2fc │ pval      │
├─────┼──────┼────────────┼───────────┤
│ 1   │ "a"  │ 0.0849625  │ -0.124939 │

Are you using Query master, or the last released version?

I’m on v0.3.1 on top of julia 0.5.1

I just upgraded to the latest release.

I don’t think this can be done in one query, and your three query version is almost certainly the most efficient way to do this.

Your line that starts with @let negcontrol = @from l in k begin will certainly not work: k is the new range variable that will refer to the current row, so using k as the source for another query won’t work. A query of this form:

@from i in X begin
@select i into k
...

will be rewritten as

temp1 = @from i in X begin
    @select i
end
@from k in temp1 begin 
...
end