Compare data from two dataframes

Sandy45 · August 16, 2024, 6:27am

I have two dataframes as shown below and I need to create csv file or json output like shown below.

I need to get dominant row for each product for each date from two dataframes then compare them to generate below output.

Ex: df has product “ABCDEF” with categories “123” and “456” for month of May 2024 and “total” value for them respectively is “15” and “25” since 25 is higher my dominant category for product “ABCDEF” is “456”.

In similar lines df2 has product “ABCDEF” with categories “232” and “456” for month of May 2024 and “total” value for them respectively is “38” and “25” since 38 is higher my dominant category for product “ABCDEF” is “232”.

As we have dominant row from both dataframes we need to compare these two rows to find some differences respective to product, category and total fields. We need to display output as shown below with three rows for each product for each month.

product,Changetype,From_052024,To_052024
ABCDEF,Category,456(25),232(38)
ABCDEF,product,expired,new4
ABCDEF,total,63,40

In the first row above we will show dominant category from df and it’s count(value from total column) in the brackets under from section similarly under to section we show dominant category from df2 and it’s count in the brackets.

In the second section we are comparing category in df and df2 if they are not equal then we will mark “from_” column as expired and “to_” column as new.

In third row we are printing “total” for that “product” irrespective of “category” for that month.

(or)

JSON response can be below format:

[
  {
    "product": "ABCDEF",
    "Changetype": "Category",
    "From_May-2024": "456(25)",
    "To_May-2024": "232(28)",
    "From_Sep-2024": "xxx(xxx)",
    "To_Sep-2024": "xxx(xx)"
  },
  {
    "product": "ABCDEF",
    "Changetype": "product",
    "From_May-2024": "expired",
    "To_May-2024": "new",
    "From_Sep-2024": "xxx",
    "To_Sep-2024": "xxx"
  },
  {
    "product": "ABCDEF",
    "Changetype": "total",
    "From_May-2024": "63",
    "To_May-2024": "40",
    "From_Sep-2024": "xxx",
    "To_Sep-2024": "xxx"
  }
]

julia> df
5×4 DataFrame
 Row │ product  category  dte     total 
     │ String   String    Int64   Int64 
─────┼──────────────────────────────────
   1 │ ABCDEF   123       202405     15 
   2 │ XYZDEF   256       202403     20 
   3 │ ABCDEF   456       202405     25 
   4 │ WERTYU   789       202404     16 
   5 │ XYZDEF   261       202403     18

julia> df2
5×4 DataFrame
 Row │ product  category  dte     total 
     │ String   String    Int64   Int64 
─────┼──────────────────────────────────
   1 │ ABCDEF   232       202405     38
   2 │ XYZDEF   256       202403     20
   3 │ ABCDEF   456       202405     25
   4 │ WERTYU   789       202404     16
   5 │ XYZDEF   272       202403     48

ufechner7 · August 16, 2024, 6:34am

What is your question? What did you try so far?

Sandy45 · August 16, 2024, 6:46am

My apologies as I forgot to put up note. I know that we could achieve this using dictionaries by writing multiple for loops and trying to check if this is the efficient approach or if we have any better solution to this. I am trying to avoid writing loops as my df can be huge and would slow down my app response.

nilshg · August 17, 2024, 1:57pm

I’m not sure I fully understand your question but I think you are looking for leftjoin followed by a groupby over product.

Topic		Replies	Views
Is there a package to do DataFrames comparisons? General Usage dataframes	4	1305	August 3, 2021
Comparison 2 data sets Finance and Economics testing , data	6	806	April 5, 2022
DataFrame query General Usage dataframes	7	319	August 27, 2022
Differences in Two DataFrames General Usage dataframes	5	899	November 11, 2022
Dataframes difficult joins Data data , dataframes , data_structures	3	600	February 10, 2021

Compare data from two dataframes

Related topics