Strategy for recoding

edvin · May 25, 2023, 10:02am

I have a number of similar (but slightly different variables) that I need to recode in order to harmonize over several different years. The numbers correspond to an occupational scheme, so they have no inherent meaning other then identification.

The following code does what I want, but it becomes very verbose as I need slight variations for each variable (x in this case). Does anyone have suggestions on how I could do this in a better/less verbose (but still clear) way?

df = DataFrame(x1 = vcat(1:15, missing),
               x2 = [11, 12, 21, 35, 36, 45, 46, 56, 57, 58, 59, 71, 72, 87, 88, 99])

       transform!(df, :x1 => ByRow(passmissing(x ->
              x == 1       ? 11 :
              x == 2       ? 22 :
              x == 3       ? 33 :
              x == 4       ? 36 :
              x == 5       ? 45 :
              x == 6       ? 56 :
              x ∈ [9, 10, 11]  ? 71 :
              x ∈ [7, 8]    ? 79 :
              Integer(x))) => :x1)

       transform!(df, :x2 => ByRow(passmissing(x ->
              x ∈ [11, 12] ? 11 :
              x ∈ [21, 22] ? 22 :
              x ∈ [35, 36] ? 36 :
              x ∈ [45, 46] ? 45 :
              x ∈ [56, 57, 58, 59, 60] ? 56 :
              x ∈ [71, 72, 73, 74] ? 71 :
              x ∈ [86, 87, 88, 89] ? 79 :
              x == 99 ? missing :
              Integer(x))) => :x2)

Thanks!

nilshg · May 25, 2023, 10:15am

I tend to keep these mappings seperate in a dictionary, and then get the values I need like this:

julia> codes = Dict(1 => 11, 2 => 22, 3 => 33, 4 => 36, 5 => 45, 6 => 56, ([9, 10, 11] .=> 71)..., ([7, 8] .=> 79)...)
Dict{Int64, Int64} with 11 entries:
  5  => 45
  7  => 79
  8  => 79
  1  => 11
  4  => 36
  6  => 56
  2  => 22
  10 => 71
  11 => 71
  9  => 71
  3  => 33

julia> get.((codes, ), df.x1, missing)
16-element Vector{Union{Missing, Int64}}:
 11
 22
 33
 36
 45
 56
 79
 79
 71
 71
 71
   missing
   missing
   missing
   missing
   missing

rocco_sprmnt21 · May 25, 2023, 1:50pm

It’s not clear to me if you’re looking for some pattern that can better synthesize the sequence of transforms or whatever.
Something like this achieves the same result as the example.
If there are no “regularities” in the transforms you use, your best bet is what Nilshg suggests

transform!(df, :x1 => ByRow(passmissing(x ->
            x ∈ [1,2,3] ? 11x :   
            x ∈ [4,5] ? 9x :      
              x == 6 ? 56 :
              x ∈ [9, 10, 11]  ? 71 :
              x ∈ [7, 8]    ? 79 :
              Integer(x))) => :x1)

edvin · May 26, 2023, 11:49am

Thanks for the answer! Why is the splat needed to make this work?

nilshg · May 26, 2023, 12:04pm

Because we want to turn [7, 8] .=> 79 into multiple arguments to the Dict constructor, in order to mix “normal” arguments like 1 => 11 with the vector arguments that use broadcasting. Consider:

julia> Dict([7, 8] .=> 79)
Dict{Int64, Int64} with 2 entries:
  7 => 79
  8 => 79

julia> Dict(5, [7, 8] .=> 79)
ERROR: MethodError: no method matching Dict(::Int64, ::Vector{Pair{Int64, Int64}})

Topic		Replies	Views
Replace values in a DataFrame column based on a Dict of corresponding values New to Julia question , dataframes	5	1571	June 4, 2021
Ifelse and datawrangling New to Julia control , dataframes	6	576	June 10, 2022
Rewriting map over Dict in julia 1.0 New to Julia	4	3016	April 22, 2020
Recode multiple columns in DataFrame as integer General Usage dataframes	8	467	February 19, 2022
Dict from dataframe General Usage dictionary , dataframes	5	1317	July 15, 2022

Strategy for recoding

Related topics