Dividing a Dataframe

I have a dataframe

 Row │ VarietyA                           VarietyB                       Variety.p.adj 
     │ String                             String                         Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │ Backwoods Blueberries Coville      128th Northland                   0.0011105
   2 │ Backwoods Blueberries Jersey       128th Northland                   0.98911
   3 │ Bazan Duke                         128th Northland                   0.559359
   4 │ Bazan Jersey                       128th Northland                   0.746563
   5 │ Brown Site BlueJay                 128th Northland                   0.00315533
   6 │ Brown Site Jersey                  128th Northland                   0.0131208
   7 │ Otis Lake V. corymbosum            128th Northland                   0.183493
   8 │ Russells Blueberry Farm and Book…  128th Northland                   0.0792741
   9 │ Talsma Aurora                      128th Northland                   0.946847
  10 │ Talsma BlueJay                     128th Northland                   0.858454
  11 │ Talsma Liberty                     128th Northland                   0.00410839
  12 │ Wolf Lake V. angustifolium         128th Northland                   0.360707
  13 │ Backwoods Blueberries Jersey       Backwoods Blueberries Coville     0.0715494
  14 │ Bazan Duke                         Backwoods Blueberries Coville     0.0858854
  15 │ Bazan Jersey                       Backwoods Blueberries Coville     0.795704
  16 │ Brown Site BlueJay                 Backwoods Blueberries Coville     1.27e-7
  17 │ Brown Site Jersey                  Backwoods Blueberries Coville     5.15e-7
  18 │ Otis Lake V. corymbosum            Backwoods Blueberries Coville     3.91e-6
  19 │ Russells Blueberry Farm and Book…  Backwoods Blueberries Coville     0.999625
  20 │ Talsma Aurora                      Backwoods Blueberries Coville     2.34e-5

What I would Like to do is find all the instances of Variety X (where X is any one of 13 Varieties) in VarietyA and VarietyB and put them into that into a grouped dataframe. similar to the table below.

VarietyA VarietyB Variety.p.adj
Brown Site Jersey 128th Northland 0.013120833
Brown Site Jersey Backwoods Blueberries Coville 5.15E-07
Brown Site Jersey Backwoods Blueberries Jersey 0.00564981
Brown Site Jersey Bazan Duke 0.000177299
Brown Site Jersey Bazan Jersey 0.005307901
Brown Site Jersey Brown Site BlueJay 0.999987571
Otis Lake V. corymbosum Brown Site Jersey 0.925359246
Russells Blueberry Farm and Book Barn Northland Brown Site Jersey 0.000111835
Talsma Aurora Brown Site Jersey 0.082719439
Talsma BlueJay Brown Site Jersey 0.008174087
Talsma Liberty Brown Site Jersey 1.26E-06
Wolf Lake V. angustifolium Brown Site Jersey 4.94E-05

The idea of this is to put all of one variety together to make it easier to interpret the results.

Thanks
Mike

but in this table, neither VarietyA or VarietyB is unique? do you want sort the dataframe or group them?

Perhaps filter([:VarietyA, :VarietyB] => (a, b) -> a == "Brown Site Jersey" || b == "Brown Site Jersey", df). Then you could sort with a custom lt that prefers “Brown Site Jersey” in each column.

You won’t be able to group for each of your 13 varieties, because in a GroupedDataframe, a row can only belong to a single group.

oh it’s a or condition, in that case:

mask = @. df.VarietyA == "Brown Site Jersey" || df.VarietyB == "Brown Site Jersey"
df[mask, :]

will do?

DataFrames has subset for that.

using DataFrames

df = DataFrame(a = ["apple", "banana", "cherry"], b = ["apple", "orange", "apricot"])
subset(df, [:a, :b] => ByRow((a, b) -> a == b))
1×2 DataFrame
 Row │ a       b      
     │ String  String 
─────┼────────────────
   1 │ apple   apple

Read this as “find the variables in df where row a and row b are the same”

Either filter or subset works fine for this case, but filter works by row, so it doesn’t require ByRow here. On the other hand, subset works by column, so it is more general. The DataFrames documentation encourages preference for subset:

This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset function instead as it is consistent with other DataFrames.jl functions (as opposed to filter).

See also Filter and Subset - Julia Data Science

1 Like

with DataFramesMeta.jl you have

df_sub = @rsubset df :VarietyA == "Brown Site Jersey" || :VarietyB == "Brown Site Jersey"