Dividing a Dataframe

SergeantMike67 · August 15, 2025, 3:07pm

I have a dataframe

 Row │ VarietyA                           VarietyB                       Variety.p.adj 
     │ String                             String                         Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │ Backwoods Blueberries Coville      128th Northland                   0.0011105
   2 │ Backwoods Blueberries Jersey       128th Northland                   0.98911
   3 │ Bazan Duke                         128th Northland                   0.559359
   4 │ Bazan Jersey                       128th Northland                   0.746563
   5 │ Brown Site BlueJay                 128th Northland                   0.00315533
   6 │ Brown Site Jersey                  128th Northland                   0.0131208
   7 │ Otis Lake V. corymbosum            128th Northland                   0.183493
   8 │ Russells Blueberry Farm and Book…  128th Northland                   0.0792741
   9 │ Talsma Aurora                      128th Northland                   0.946847
  10 │ Talsma BlueJay                     128th Northland                   0.858454
  11 │ Talsma Liberty                     128th Northland                   0.00410839
  12 │ Wolf Lake V. angustifolium         128th Northland                   0.360707
  13 │ Backwoods Blueberries Jersey       Backwoods Blueberries Coville     0.0715494
  14 │ Bazan Duke                         Backwoods Blueberries Coville     0.0858854
  15 │ Bazan Jersey                       Backwoods Blueberries Coville     0.795704
  16 │ Brown Site BlueJay                 Backwoods Blueberries Coville     1.27e-7
  17 │ Brown Site Jersey                  Backwoods Blueberries Coville     5.15e-7
  18 │ Otis Lake V. corymbosum            Backwoods Blueberries Coville     3.91e-6
  19 │ Russells Blueberry Farm and Book…  Backwoods Blueberries Coville     0.999625
  20 │ Talsma Aurora                      Backwoods Blueberries Coville     2.34e-5

What I would Like to do is find all the instances of Variety X (where X is any one of 13 Varieties) in VarietyA and VarietyB and put them into that into a grouped dataframe. similar to the table below.

VarietyA	VarietyB	Variety.p.adj
Brown Site Jersey	128th Northland	0.013120833
Brown Site Jersey	Backwoods Blueberries Coville	5.15E-07
Brown Site Jersey	Backwoods Blueberries Jersey	0.00564981
Brown Site Jersey	Bazan Duke	0.000177299
Brown Site Jersey	Bazan Jersey	0.005307901
Brown Site Jersey	Brown Site BlueJay	0.999987571
Otis Lake V. corymbosum	Brown Site Jersey	0.925359246
Russells Blueberry Farm and Book Barn Northland	Brown Site Jersey	0.000111835
Talsma Aurora	Brown Site Jersey	0.082719439
Talsma BlueJay	Brown Site Jersey	0.008174087
Talsma Liberty	Brown Site Jersey	1.26E-06
Wolf Lake V. angustifolium	Brown Site Jersey	4.94E-05

The idea of this is to put all of one variety together to make it easier to interpret the results.

Thanks
Mike

jling · August 15, 2025, 3:22pm

but in this table, neither VarietyA or VarietyB is unique? do you want sort the dataframe or group them?

Jeff_Emanuel · August 15, 2025, 3:28pm

Perhaps filter([:VarietyA, :VarietyB] => (a, b) -> a == "Brown Site Jersey" || b == "Brown Site Jersey", df). Then you could sort with a custom lt that prefers “Brown Site Jersey” in each column.

You won’t be able to group for each of your 13 varieties, because in a GroupedDataframe, a row can only belong to a single group.

jling · August 15, 2025, 3:35pm

oh it’s a or condition, in that case:

mask = @. df.VarietyA == "Brown Site Jersey" || df.VarietyB == "Brown Site Jersey"
df[mask, :]

will do?

technocrat · August 15, 2025, 7:06pm

DataFrames has subset for that.

using DataFrames

df = DataFrame(a = ["apple", "banana", "cherry"], b = ["apple", "orange", "apricot"])
subset(df, [:a, :b] => ByRow((a, b) -> a == b))
1×2 DataFrame
 Row │ a       b      
     │ String  String 
─────┼────────────────
   1 │ apple   apple

Read this as “find the variables in df where row a and row b are the same”

Jeff_Emanuel · August 15, 2025, 7:51pm

Either filter or subset works fine for this case, but filter works by row, so it doesn’t require ByRow here. On the other hand, subset works by column, so it is more general. The DataFrames documentation encourages preference for subset:

This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset function instead as it is consistent with other DataFrames.jl functions (as opposed to filter).

See also Filter and Subset - Julia Data Science

pdeffebach · August 15, 2025, 9:54pm

with DataFramesMeta.jl you have

df_sub = @rsubset df :VarietyA == "Brown Site Jersey" || :VarietyB == "Brown Site Jersey"

Topic		Replies	Views
Intersection of two dataframes based on columns General Usage	4	6765	July 14, 2024
Applying group selection conditions using groupeddataframes New to Julia dataframes	21	2314	September 20, 2020
Subset a dataframe by column of another dataframe Data question	4	2247	March 1, 2021
Filter doesn't work on grouped dataframe General Usage dataframes	5	1592	February 4, 2022
Creating sub-arrays by comparing multiple arrays DataFrames General Usage array , dataframes	4	626	April 26, 2019

Dividing a Dataframe

Related topics