Group by Multi Column Dataframe to share ID? Maybe?

Hey ladies and gents!

I’ve been searching high and low for the best way to do a multi column groupby, i think.

Essentially i have a dataframe. with a multitude of columns but only 2 that matter in this contex.

I do a

g = groupby(df, :currencyA);
insertcols!(df, 1, :idA => g.groups)

which gets me a new column of id per currencyA column. something like

idA  currencyA    currency B
1        USD                    JPY
2       EUR                     USD
3       JPY                     EUR

I want to essentially create a second IdB column that follows the Id pattern generated from groupby of currencyA based on matching values in currencyB.

For example :

idA  currencyA    idB     currency B
1       USD                   3             JPY
2      EUR                    1             USD
3      JPY                     2             EUR

so IdB assigns the Id’s from A with the matching currency. Usd remains id 1 vs Eur remains id 2 etc. So i could track these currencies in whatever column by the same Id ref #.

if i were to groupby currencyB all over again japan would have in IdB a index of 1 and the others would follow sequentially.
I need to be able to track the same Id value based on the value of the currency. It probably makes no sense why i need this but unfort i do :confused:

REALLY appreciate the help. Been banging my head for a few days now.

I don’t understand. Whats wrong with grouping again?

You can create a dictionary of currency to ID mapping outside of the dataframe:

julia> df = DataFrame(currencyA = rand(["USD", "EUR", "JPY"], 8), currencyB =  rand(["USD", "EUR", "JPY"], 8))
8Γ—2 DataFrame
 Row β”‚ currencyA  currencyB 
     β”‚ String     String    
─────┼──────────────────────
   1 β”‚ JPY        JPY
   2 β”‚ USD        JPY
   3 β”‚ JPY        JPY
   4 β”‚ USD        EUR
   5 β”‚ JPY        EUR
   6 β”‚ EUR        EUR
   7 β”‚ USD        USD
   8 β”‚ JPY        JPY

julia> curr_ids = Dict(map(enumerate(unique(df.currencyA))) do (idx, currency)
         currency => idx
       end)
Dict{String, Int64} with 3 entries:
  "EUR" => 3
  "JPY" => 1
  "USD" => 2

and then use that to generate the id columns in one go:

julia> transform(df, [:currencyA, :currencyB] .=> ByRow(c -> curr_ids[c]) .=> [:idA, :idB])
8Γ—4 DataFrame
 Row β”‚ currencyA  currencyB  idA    idB   
     β”‚ String     String     Int64  Int64 
─────┼────────────────────────────────────
   1 β”‚ JPY        JPY            1      1
   2 β”‚ USD        JPY            2      1
   3 β”‚ JPY        JPY            1      1
   4 β”‚ USD        EUR            2      3
   5 β”‚ JPY        EUR            1      3
   6 β”‚ EUR        EUR            3      3
   7 β”‚ USD        USD            2      2
   8 β”‚ JPY        JPY            1      1
1 Like

Amazing ! Cheers legend, thank you so much ! :slight_smile:

1 Like

Just a complement to the nice solution above, which might be easier for some to read:

d = Dict(reverse.(enumerate(unique(df.currencyA))))
df.idA = [d[x] for x in df.currencyA]
df.idB = [d[x] for x in df.currencyB]
1 Like

using some internal …


df = DataFrame(cA = rand(["USD", "EUR", "JPY"], 8), cB =  rand(["USD", "EUR", "JPY"], 8))

g=groupby(df,:cA)

df.idA=groupindices(g)

transform(df, :cB=>ByRow(r->g.keymap[tuple(r)])=>:idB)

Without paying too much attention to efficiency …

df.r=1:nrow(df)

df.idA=groupindices(groupby(sort!(df,:cA),:cA))

df.idB=groupindices(groupby(sort!(df,:cB),:cB))

sort!(df,:r)

With DataFrameMacros.jl you can do both columns in one operation, it allows multi-column expressions.

julia> using DataFrames, DataFrameMacros

julia> df = DataFrame(
           currencyA = rand(["USD", "EUR", "JPY"], 8),
           currencyB =  rand(["USD", "EUR", "JPY"], 8)
       )
8Γ—2 DataFrame
 Row β”‚ currencyA  currencyB 
     β”‚ String     String    
─────┼──────────────────────
   1 β”‚ JPY        JPY
   2 β”‚ USD        USD
   3 β”‚ USD        JPY
   4 β”‚ JPY        USD
   5 β”‚ EUR        JPY
   6 β”‚ JPY        USD
   7 β”‚ JPY        USD
   8 β”‚ JPY        JPY
julia> d = Dict(reverse.(enumerate(unique(df.currencyA))))
Dict{String, Int64} with 3 entries:
  "EUR" => 3
  "JPY" => 1
  "USD" => 2

julia> @transform(df, [:idA, :idB] = d[{[:currencyA, :currencyB]}])
8Γ—4 DataFrame
 Row β”‚ currencyA  currencyB  idA    idB   
     β”‚ String     String     Int64  Int64 
─────┼────────────────────────────────────
   1 β”‚ JPY        JPY            1      1
   2 β”‚ USD        USD            2      2
   3 β”‚ USD        JPY            2      1
   4 β”‚ JPY        USD            1      2
   5 β”‚ EUR        JPY            3      1
   6 β”‚ JPY        USD            1      2
   7 β”‚ JPY        USD            1      2
   8 β”‚ JPY        JPY            1      1

Or even just:

julia> @transform(df, [:idA, :idB] = d[{All()}])
8Γ—4 DataFrame
 Row β”‚ currencyA  currencyB  idA    idB   
     β”‚ String     String     Int64  Int64 
─────┼────────────────────────────────────
   1 β”‚ JPY        JPY            1      1
   2 β”‚ USD        USD            2      2
   3 β”‚ USD        JPY            2      1
   4 β”‚ JPY        USD            1      2
   5 β”‚ EUR        JPY            3      1
   6 β”‚ JPY        USD            1      2
   7 β”‚ JPY        USD            1      2
   8 β”‚ JPY        JPY            1      1
1 Like