~TidierData.jl~ - Apply Transform to Many Columns

I have a dataset with a bunch of columns with string data that have the format “(1.0, 2.0)”. All of these columns end with the postfix “CI”. I’d like to apply a function that splits the numbers into the low and high value and parses them as floats. Then I’d like to replace the original columns with $(original_name)_low and $(original_name)_low.

I know how I would do this with DataFrames.jl DSL, but I’d really like to achieve this with TiderData.jl because it makes my code more accessible for my R peers.

Any chance it’s possible without hard coding all the column names?

2 Likes

here is one possible solution where you turn the string into a tuple or an array and then use @unnest_wider.
@kdpsingh may be able to offer others that are better too

using TidierData, TidierStrings
# this function assumes theyre always (lo, hi)
function str_to_array(s::String)
        s_clean = strip(s, ['(', ')'])
        return parse.(Float64, split(s_clean, ","))
end

df = DataFrame(
    a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
    b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
)

@chain df begin
  @mutate(across(ends_with("CI"), (x -> str_to_array.(x))))
  @unnest_wider(ends_with("function"))
  @rename_with(x-> str_replace(x, "function1" ,"low"))
  @rename_with(x-> str_replace(x, "function2" ,"high"))
end
3×6 DataFrame
 Row │ a_CI        b_CI        a_CI_low  a_CI_high  b_CI_low  b_CI_high 
     │ String      String      Float64   Float64    Float64   Float64   
─────┼──────────────────────────────────────────────────────────────────
   1 │ (1.0, 2.0)  (0.1, 1.1)       1.0        2.0       0.1        1.1
   2 │ (2.5, 3.5)  (0.5, 1.5)       2.5        3.5       0.5        1.5
   3 │ (3.2, 4.1)  (1.2, 2.1)       3.2        4.1       1.2        2.1

this line could be used instead of two @rename_withs like above. I could possibly add a str_replace_many to TidierStrings for the future as well.

@chain df begin
  @mutate(across(ends_with("CI"), (x -> str_to_array.(x))))
  @unnest_wider(ends_with("function"))
  @rename_with(x-> replace(x, "function1" => "low", "function2" => "high"))
end
2 Likes

Now that you have a solution with TidierData I hope it’s not annoying to post a solution with (disclaimer: my package) DataFrameMacros, it’s a lot of fun to try and solve these data wrangling “code golfing” problems and I was happy to find a one-liner for this:

julia> using DataFrameMacros

julia> df = DataFrame(
           a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
           b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
       )
3×2 DataFrame
 Row │ a_CI        b_CI
     │ String      String
─────┼────────────────────────
   1 │ (1.0, 2.0)  (0.1, 1.1)
   2 │ (2.5, 3.5)  (0.5, 1.5)
   3 │ (3.2, 4.1)  (1.2, 2.1)

julia> @select df string.({}, ["_low", "_high"]) =
           [parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", {All()})]
3×4 DataFrame
 Row │ a_CI_low  a_CI_high  b_CI_low  b_CI_high
     │ Float64   Float64    Float64   Float64
─────┼──────────────────────────────────────────
   1 │      1.0        2.0       0.1        1.1
   2 │      2.5        3.5       0.5        1.5
   3 │      3.2        4.1       1.2        2.1

The string.({}, ["_low", "_high"]) part expands to [["a_CI_low", "a_CI_high"], ["b_CI_low", "b_CI_high"]] so in DataFrames minilanguage it’s like having [:a_CI, :b_CI] .=> the_function .=> [["a_CI_low", "a_CI_high"], ["b_CI_low", "b_CI_high"]].

7 Likes

Here’s how you could use basic DataFrame, borrowing your function to extract numbers from strings.

p(s)=[parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", s)]

transform(df, All().=>ByRow(p).=>x->[x*"_l",x*"_h"])

onelinear

transform(df, All().=>ByRow(s->[parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", s)]).=>x->[x*"_l",x*"_h"])

if you only want the transformed columns

select(df, All().=>ByRow(s->[parse(Float64, m.match) for m in
 eachmatch(r"\d+\.?\d+", s)]).=>x->[x*"_l",x*"_h"])

parsing in a different way

select(df, All().=>ByRow(s->parse.(Float64, split(s[2:end-1],','))) .=>x->[x*"_l",x*"_h"])
select(df, All().=>ByRow(s->parse.(Float64,split(s[2:end-1],',')))
.=>n->n .*["_l","_h"])
1 Like