Group by Multi Column Dataframe to share ID? Maybe?

theboiwholived · October 13, 2022, 9:23pm

Hey ladies and gents!

I’ve been searching high and low for the best way to do a multi column groupby, i think.

Essentially i have a dataframe. with a multitude of columns but only 2 that matter in this contex.

I do a

g = groupby(df, :currencyA);
insertcols!(df, 1, :idA => g.groups)

which gets me a new column of id per currencyA column. something like

idA  currencyA    currency B
1        USD                    JPY
2       EUR                     USD
3       JPY                     EUR

I want to essentially create a second IdB column that follows the Id pattern generated from groupby of currencyA based on matching values in currencyB.

For example :

idA  currencyA    idB     currency B
1       USD                   3             JPY
2      EUR                    1             USD
3      JPY                     2             EUR

so IdB assigns the Id’s from A with the matching currency. Usd remains id 1 vs Eur remains id 2 etc. So i could track these currencies in whatever column by the same Id ref #.

if i were to groupby currencyB all over again japan would have in IdB a index of 1 and the others would follow sequentially.
I need to be able to track the same Id value based on the value of the currency. It probably makes no sense why i need this but unfort i do

REALLY appreciate the help. Been banging my head for a few days now.

pdeffebach · October 13, 2022, 9:28pm

I don’t understand. Whats wrong with grouping again?

digital_carver · October 13, 2022, 9:50pm

You can create a dictionary of currency to ID mapping outside of the dataframe:

julia> df = DataFrame(currencyA = rand(["USD", "EUR", "JPY"], 8), currencyB =  rand(["USD", "EUR", "JPY"], 8))
8×2 DataFrame
 Row │ currencyA  currencyB 
     │ String     String    
─────┼──────────────────────
   1 │ JPY        JPY
   2 │ USD        JPY
   3 │ JPY        JPY
   4 │ USD        EUR
   5 │ JPY        EUR
   6 │ EUR        EUR
   7 │ USD        USD
   8 │ JPY        JPY

julia> curr_ids = Dict(map(enumerate(unique(df.currencyA))) do (idx, currency)
         currency => idx
       end)
Dict{String, Int64} with 3 entries:
  "EUR" => 3
  "JPY" => 1
  "USD" => 2

and then use that to generate the id columns in one go:

julia> transform(df, [:currencyA, :currencyB] .=> ByRow(c -> curr_ids[c]) .=> [:idA, :idB])
8×4 DataFrame
 Row │ currencyA  currencyB  idA    idB   
     │ String     String     Int64  Int64 
─────┼────────────────────────────────────
   1 │ JPY        JPY            1      1
   2 │ USD        JPY            2      1
   3 │ JPY        JPY            1      1
   4 │ USD        EUR            2      3
   5 │ JPY        EUR            1      3
   6 │ EUR        EUR            3      3
   7 │ USD        USD            2      2
   8 │ JPY        JPY            1      1

theboiwholived · October 14, 2022, 3:35am

Amazing ! Cheers legend, thank you so much !

rafael.guerra · October 14, 2022, 7:20am

Just a complement to the nice solution above, which might be easier for some to read:

d = Dict(reverse.(enumerate(unique(df.currencyA))))
df.idA = [d[x] for x in df.currencyA]
df.idB = [d[x] for x in df.currencyB]

rocco_sprmnt21 · October 14, 2022, 9:36am

using some internal …


df = DataFrame(cA = rand(["USD", "EUR", "JPY"], 8), cB =  rand(["USD", "EUR", "JPY"], 8))

g=groupby(df,:cA)

df.idA=groupindices(g)

transform(df, :cB=>ByRow(r->g.keymap[tuple(r)])=>:idB)

Without paying too much attention to efficiency …

df.r=1:nrow(df)

df.idA=groupindices(groupby(sort!(df,:cA),:cA))

df.idB=groupindices(groupby(sort!(df,:cB),:cB))

sort!(df,:r)

jules · October 14, 2022, 11:37am

With DataFrameMacros.jl you can do both columns in one operation, it allows multi-column expressions.

julia> using DataFrames, DataFrameMacros

julia> df = DataFrame(
           currencyA = rand(["USD", "EUR", "JPY"], 8),
           currencyB =  rand(["USD", "EUR", "JPY"], 8)
       )
8×2 DataFrame
 Row │ currencyA  currencyB 
     │ String     String    
─────┼──────────────────────
   1 │ JPY        JPY
   2 │ USD        USD
   3 │ USD        JPY
   4 │ JPY        USD
   5 │ EUR        JPY
   6 │ JPY        USD
   7 │ JPY        USD
   8 │ JPY        JPY

julia> d = Dict(reverse.(enumerate(unique(df.currencyA))))
Dict{String, Int64} with 3 entries:
  "EUR" => 3
  "JPY" => 1
  "USD" => 2

julia> @transform(df, [:idA, :idB] = d[{[:currencyA, :currencyB]}])
8×4 DataFrame
 Row │ currencyA  currencyB  idA    idB   
     │ String     String     Int64  Int64 
─────┼────────────────────────────────────
   1 │ JPY        JPY            1      1
   2 │ USD        USD            2      2
   3 │ USD        JPY            2      1
   4 │ JPY        USD            1      2
   5 │ EUR        JPY            3      1
   6 │ JPY        USD            1      2
   7 │ JPY        USD            1      2
   8 │ JPY        JPY            1      1

Or even just:

julia> @transform(df, [:idA, :idB] = d[{All()}])
8×4 DataFrame
 Row │ currencyA  currencyB  idA    idB   
     │ String     String     Int64  Int64 
─────┼────────────────────────────────────
   1 │ JPY        JPY            1      1
   2 │ USD        USD            2      2
   3 │ USD        JPY            2      1
   4 │ JPY        USD            1      2
   5 │ EUR        JPY            3      1
   6 │ JPY        USD            1      2
   7 │ JPY        USD            1      2
   8 │ JPY        JPY            1      1

Topic		Replies	Views
Reduce DataFrame by unique value General Usage dataframes	5	1043	October 14, 2022
Create new column with first row of each id General Usage dataframes	1	265	February 9, 2022
DataFrame Groupby New to Julia dataframes	2	2143	April 26, 2018
Generating group id column in DataFrames? Data dataframes	17	1309	July 19, 2022
JuliaDB groupby many columns General Usage juliadb	0	361	April 9, 2021

Group by Multi Column Dataframe to share ID? Maybe?

Related topics