Add colum to dataframe with values based on various conditions

daharn · August 30, 2022, 6:00pm

I have the following structure:

A DataFrame containing parameters and results from numeric simulations.
A Vector{Dict} object. Each Dict contains keys equivalent to some of the columns of the DataFrame and the corresponding values.
A mapping of sort that assigns some result to each case in the list of Dicts. This mapping is hard-coded, so I manually assign a value for each case

MWE:

using DataFrames

df = DataFrame(:a => [1,2,3,1], :b => [5,6,7,8], :c =>[9,10,11,9])
cases = [Dict(:a => 1, :c => 9), Dict(:a => 3, :c => 11)]
mapping = Dict(case => value for (case,value) in zip(cases,["result1","result2"]))

What I would like to have in the end is:

df_final = DataFrame(:a => [1,2,3,1], :b => [5,6,7,8], :c =>[9,10,11,9], :result => ["result1", missing,"result2","result1"])

4×4 DataFrame
 Row │ a      b      c      value   
     │ Int64  Int64  Int64  String? 
─────┼──────────────────────────────
   1 │     1      5      9  result1
   2 │     2      6     10  missing 
   3 │     3      7     11  result2
   4 │     1      8      9  result1

So essentially I would like to loop through the list of cases, find all rows of df where all parameters are identical to the current case (this will be several rows) and then set the value of the result column accordingly. It would be nice if I wouldn’t have to explicitly code down all of the parameters to check for.

A very ugly hack to achieve this would be

for (case,value) in mapping
    @view(df[vec(all(hcat((df[!,k] .== case[k] for k in keys(case))...),dims=2)),:]).value .= value
end

but I would much rather get there with DataFrames or DataFramesMeta syntax.

I could also try to create a GroupedDataFrame first based on the keys from the case Dict:

gdf = groupby(df, [keys(cases[1])...])

But then I still have to map each group to its corresponding Dict/result, if it exists, which is kind of the wrong way round of the loop.

Does anyone have advice on how to best achieve this?

bkamins · August 30, 2022, 6:09pm

Do you want this?

julia> df2 = DataFrame(cases)
2×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      9
   2 │     3     11

julia> df2.value = ["result1", "result2"]
2-element Vector{String}:
 "result1"
 "result2"

julia> leftjoin(df, df2, on=[:a, :c])
4×4 DataFrame
 Row │ a      b      c      value
     │ Int64  Int64  Int64  String?
─────┼──────────────────────────────
   1 │     1      5      9  result1
   2 │     3      7     11  result2
   3 │     1      8      9  result1
   4 │     2      6     10  missing

daharn · August 30, 2022, 6:14pm

Wow, I knew there would be something a lot more nicer out there! Sorry I obviously haven’t paid enough attention to the joining methods.

~~Two~~ One little caveat though: Can the row order of df be maintained? ~~and can the value of on be implicitly derived from cases?~~ → it can by on=[keys(cases[1])...]

bkamins · August 30, 2022, 6:26pm

Currently not. This is something we will likely add in 1.5 release. For now add :id column numbered 1,2,… and sort the data frame afterwards.

The point is that both left and right tables in general can have duplicates and then row order is not uniquely defined. We will need to make a decision which row order to use in such cases, but it has not been implemented yet.

The on value can be derived from the cases as you have written.

bkamins · August 30, 2022, 6:32pm

One more comment. If you are sure that right table does not have duplicates you can do:

julia> leftjoin!(df, df2, on=[:a, :c])
4×4 DataFrame
 Row │ a      b      c      value
     │ Int64  Int64  Int64  String?
─────┼──────────────────────────────
   1 │     1      5      9  result1
   2 │     2      6     10  missing
   3 │     3      7     11  result2
   4 │     1      8      9  result1

which maintains row order (it does in-place update so if you want a fresh data frame do leftjoin!(copy(df), df2, on=[:a, :c]))

daharn · August 31, 2022, 9:19am

Perfect, thanks a lot for this very elegant solution!

Topic		Replies	Views
Change column to row using conditions New to Julia dataframes	4	456	June 23, 2022
Efficient way to add a value of column to all rows conditioned on another column by group Data dataframes , dataframesmeta	3	972	November 20, 2021
Combining elements from multiple rows by conditionals into columns with DataFramesMeta General Usage dataframes , dataframesmeta	10	352	July 5, 2023
Creating one column from multiple DataFrames General Usage dataframes	3	391	March 18, 2020
Take data from one DataFrame to another based on a conditional General Usage dataframes	5	1202	August 15, 2022

Add colum to dataframe with values based on various conditions

Related topics