I have a data frame that is formatted like the example below.
Row │ Pairwise p_adj boo Factor IntBoo
│ String Float64 Bool String Int64
─────┼────────────────────────────────────────────────────────────────────────────
1 │ Backwoods Blueberries Coville-12… 0.999599 false Gall Weight 0
2 │ Backwoods Blueberries Coville-12… 1.0 false Gall Length 0
3 │ Backwoods Blueberries Coville-12… 0.804792 false Gall Width 0
4 │ Backwoods Blueberries Coville-12… 2.63132e-8 true BodySize 1
5 │ Backwoods Blueberries Coville-12… 0.0822928 false Egg Count 0
6 │ Backwoods Blueberries Coville-12… 3.66398e-11 true Egg Volume 1
7 │ Backwoods Blueberries Coville-12… 0.993763 false Sex Ratio 0
8 │ Backwoods Blueberries Coville-12… 4.27341e-6 true Parasitoid 1
9 │ Backwoods Blueberries Jersey-128… 0.996053 false Gall Weight 0
10 │ Backwoods Blueberries Jersey-128… 0.583841 false Gall Length 0
What I want is for Julia to count up the number of significant findings (boo=true), how many total pairwise comparisons and a list of the factors that were significant.
Example;
Pairwise sig_fact tot_fact Factors that are significant
Backwoods Blueberries Coville-128th Northland 3 5 BodySize, Egg volume, Parasitoid
I have the first part done but am having a hard time getting the list of names for the significant factors. The intent is to export to an Excel spreadsheet then included in a Word Document
Mike Sergeant
combine(groupby(df, :Pairwise), [:Factor, :boo] => ((f, b) -> Ref(f[b])) => :Factors)
with MWE:
using DataFrames, Random;
pairWise = vcat(fill("p1", 5), fill("p2", 2));
boo = rand(Bool, 7);
factors = rand(["f1", "f2", "f3"], 7);
df = DataFrame(:Pairwise => pairWise, :boo => boo, :Factor => factors)
which gives
7×3 DataFrame
Row │ Pairwise boo Factor
│ String Bool String
─────┼─────────────────────────
1 │ p1 true f1
2 │ p1 false f2
3 │ p1 true f1
4 │ p1 false f3
5 │ p1 true f3
6 │ p2 true f3
7 │ p2 false f2
combine(groupby(df, :Pairwise), [:Factor, :boo] => ((f, b) -> Ref(f[b])) => :Factors)
2×2 DataFrame
Row │ Pairwise Factors
│ String Array…
─────┼──────────────────────────────
1 │ p1 ["f1", "f1", "f3"]
2 │ p2 ["f3"]
(I did not enforce that the factors are unique for each pairwise value).
With DataFramesMeta I think this is pretty elegant
julia> df_collaped = @chain df begin
@by :Pairwise begin
:num_factors = length(:boo)
:num_sig_factors = sum(:boo)
:sig_factors = Ref(:Factor[:boo])
end
end
Personally it helps me if I define the summary function first and then apply it, e.g.
join_if(x, c) = join(x[c], ", ")
vec_if(x, c) = [x[c]] # or Ref(x[c])? Both seemed to work.
Then split-apply-combine in some package:
using DataFrames
combine(
groupby(df, :Pairwise),
:Factor => length => :tot_fact,
:boo => sum => :sig_fact,
[:Factor, :boo] => join_if => "Factors that are significant"
)
using XLSX, DataFrames
df = DataFrame(XLSX.readtable("dataframe manipulation.xlsx", "Foglio1";infer_eltypes=true))
df.IntBoo = Bool.(df.IntBoo)
gdf = groupby(df, :Pairwise)
combine(gdf, :IntBoo => sum =>:n_sig_fac,[:IntBoo,:Factor]=>((y,x)->[x[y]])=>:sig_fac,
:Factor=>(x->length(unique(first.(split.(x)))))=>:tot_fac)
julia> combine(gdf, :IntBoo => sum =>:n_sig_fac,[:IntBoo,:Factor]=>((y,x)->[x[y]])=>:sig_fac,
:Factor=>(x->length(unique(first.(split.(x)))))=>:tot_fac)
2×4 DataFrame
Row │ Pairwise n_sig_fac sig_fac tot_fac
│ String Int64 Array… Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────
1 │ BackwoodsBlueberriesCoville-12… 3 ["BodySize", "Egg Volume", "Para… 5
2 │ BackwoodsBlueberriesJersey-128… 0 String[] 1
julia> df.boo = df.boo .== "true"
10-element BitVector:
0
0
0
1
0
1
0
1
0
0
julia> tot_fac(x)=length(unique(first.(split.(x))))
tot_fac (generic function with 1 method)
julia> combine(gdf, [:boo,:Factor]=>((y,x)->[[x[y]] sum(y) tot_fac(x)])=>[:sig_fac,:n_sig_fac,:tot_fac])
2×4 DataFrame
Row │ Pairwise sig_fac n_sig_fac tot_fac
│ String Any Any Any
─────┼────────────────────────────────────────────────────────────────────────────────────────
1 │ BackwoodsBlueberriesCoville-12… ["BodySize", "Egg Volume", "Para… 3 5
2 │ BackwoodsBlueberriesJersey-128… String[] 0 1
I notice that the type of the 3 columns resulting from the transformation is Any and not Int64 and Array[String].
I think it would be preferable to try the automatic conversion.
It seems like a significant fact to me.
Has anyone formed an opinion on this?
This is not surprising. You collect the new columns in a Vector{Any} here: [[x[y]] sum(y) tot_fac(x)]). If you had used a named tuple with AsTable you would not have had this problem.