Add colum to dataframe with values based on various conditions

I have the following structure:

  • A DataFrame containing parameters and results from numeric simulations.
  • A Vector{Dict} object. Each Dict contains keys equivalent to some of the columns of the DataFrame and the corresponding values.
  • A mapping of sort that assigns some result to each case in the list of Dicts. This mapping is hard-coded, so I manually assign a value for each case

MWE:

using DataFrames

df = DataFrame(:a => [1,2,3,1], :b => [5,6,7,8], :c =>[9,10,11,9])
cases = [Dict(:a => 1, :c => 9), Dict(:a => 3, :c => 11)]
mapping = Dict(case => value for (case,value) in zip(cases,["result1","result2"]))

What I would like to have in the end is:

df_final = DataFrame(:a => [1,2,3,1], :b => [5,6,7,8], :c =>[9,10,11,9], :result => ["result1", missing,"result2","result1"])

4Γ—4 DataFrame
 Row β”‚ a      b      c      value   
     β”‚ Int64  Int64  Int64  String? 
─────┼──────────────────────────────
   1 β”‚     1      5      9  result1
   2 β”‚     2      6     10  missing 
   3 β”‚     3      7     11  result2
   4 β”‚     1      8      9  result1

So essentially I would like to loop through the list of cases, find all rows of df where all parameters are identical to the current case (this will be several rows) and then set the value of the result column accordingly. It would be nice if I wouldn’t have to explicitly code down all of the parameters to check for.

A very ugly hack to achieve this would be

for (case,value) in mapping
    @view(df[vec(all(hcat((df[!,k] .== case[k] for k in keys(case))...),dims=2)),:]).value .= value
end

but I would much rather get there with DataFrames or DataFramesMeta syntax.

I could also try to create a GroupedDataFrame first based on the keys from the case Dict:

gdf = groupby(df, [keys(cases[1])...])

But then I still have to map each group to its corresponding Dict/result, if it exists, which is kind of the wrong way round of the loop.

Does anyone have advice on how to best achieve this?

Do you want this?

julia> df2 = DataFrame(cases)
2Γ—2 DataFrame
 Row β”‚ a      c
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1      9
   2 β”‚     3     11

julia> df2.value = ["result1", "result2"]
2-element Vector{String}:
 "result1"
 "result2"

julia> leftjoin(df, df2, on=[:a, :c])
4Γ—4 DataFrame
 Row β”‚ a      b      c      value
     β”‚ Int64  Int64  Int64  String?
─────┼──────────────────────────────
   1 β”‚     1      5      9  result1
   2 β”‚     3      7     11  result2
   3 β”‚     1      8      9  result1
   4 β”‚     2      6     10  missing

Wow, I knew there would be something a lot more nicer out there! Sorry I obviously haven’t paid enough attention to the joining methods.

Two One little caveat though: Can the row order of df be maintained? and can the value of on be implicitly derived from cases? β†’ it can by on=[keys(cases[1])...]

Currently not. This is something we will likely add in 1.5 release. For now add :id column numbered 1,2,… and sort the data frame afterwards.

The point is that both left and right tables in general can have duplicates and then row order is not uniquely defined. We will need to make a decision which row order to use in such cases, but it has not been implemented yet.

The on value can be derived from the cases as you have written.

One more comment. If you are sure that right table does not have duplicates you can do:

julia> leftjoin!(df, df2, on=[:a, :c])
4Γ—4 DataFrame
 Row β”‚ a      b      c      value
     β”‚ Int64  Int64  Int64  String?
─────┼──────────────────────────────
   1 β”‚     1      5      9  result1
   2 β”‚     2      6     10  missing
   3 β”‚     3      7     11  result2
   4 β”‚     1      8      9  result1

which maintains row order (it does in-place update so if you want a fresh data frame do leftjoin!(copy(df), df2, on=[:a, :c]))

1 Like

Perfect, thanks a lot for this very elegant solution!