This is transferring my question from https://github.com/davidanthoff/Query.jl/issues/104#issuecomment-291908459 as suggested by @davidanthoff. I have two dataframes that look like this:
df1 = DataFrame(id=[1,2,3], rel_freq=[0.5, 1, 1.5], class=["a", "a", "c"])
df2 = DataFrame(id=[1,2,3,4], rel_freq=[0.75, 0.75, 1.75, 1], class=["a", "a", "c", "c"])
println(df1)
println(df2)
#output
3×3 DataFrames.DataFrame
│ Row │ id │ rel_freq │ class │
├─────┼────┼──────────┼───────┤
│ 1 │ 1 │ 0.5 │ "a" │
│ 2 │ 2 │ 1.0 │ "a" │
│ 3 │ 3 │ 1.5 │ "c" │
4×3 DataFrames.DataFrame
│ Row │ id │ rel_freq │ class │
├─────┼────┼──────────┼───────┤
│ 1 │ 1 │ 0.75 │ "a" │
│ 2 │ 2 │ 0.75 │ "a" │
│ 3 │ 3 │ 1.75 │ "c" │
│ 4 │ 4 │ 1.0 │ "c" │
I want to calculate pvalues using a Mann Whitney test between the diffs of the rel_freqs
for all classes and compare them to class c
.
##Desired approach (single query)
The following is my attempt to do this in one query:
x = @from i in df1 begin
@join j in df2 on i.id equals j.id
@select {i.id, log2fc=log2(get(j.rel_freq)/get(i.rel_freq)), i.class} into k
@let negcontrol = @from l in k begin # try to build a list of log2fc of the negative controls
@where contains(l.class, "c")
@select l.log2fc
@collect
end
@group k by k.gene into g
@let test = MannWhitneyUTest(map(j->get(j.log2fc), g), negcontrol) #compare this group versus neg controls
@select {
gene=g.key,
avg_freq=mean(j->j.log2fc, g),
pval= -log10(pvalue(test)) # calculate p value for this group versus neg controls
}
@collect DataFrame
end
The main problem seems to be that I select all rows of the joined dataframe that contain “c” for class and then create an array of the log2fc
column for all of them to use in the Mann-Whitney test.
Working approach (multiple separate queries)
Here’s code that gives the correct result, but doesn’t do so in a single query:
x = @from i in df1 begin
@join j in df2 on i.id equals j.id
@select {i.id, log2fc=log2(get(j.rel_freq)/get(i.rel_freq)), i.class}
@collect DataFrame
end
y = @from i in x begin
@where contains(get(i.class), "c")
@select i.log2fc
@collect
end
z = @from i in x begin
@group i by i.class into g
@where !contains(get(g[1].class), "c")
@let test = MannWhitneyUTest(map(j->j.log2fc, g), y)
@select {
gene = get(g.key),
med_log2fc = median(map(j->j.log2fc, g)),
pval = -log10(pvalue(test)),
}
@collect DataFrame
end
Desired output
z
# output
1×3 DataFrames.DataFrame
│ Row │ gene │ med_log2fc │ pval │
├─────┼──────┼────────────┼───────────┤
│ 1 │ "a" │ 0.0849625 │ -0.124939 │