~TidierData.jl~ - Apply Transform to Many Columns

I have a dataset with a bunch of columns with string data that have the format “(1.0, 2.0)”. All of these columns end with the postfix “CI”. I’d like to apply a function that splits the numbers into the low and high value and parses them as floats. Then I’d like to replace the original columns with $(original_name)_low and $(original_name)_low.

I know how I would do this with DataFrames.jl DSL, but I’d really like to achieve this with TiderData.jl because it makes my code more accessible for my R peers.

Any chance it’s possible without hard coding all the column names?

2 Likes

here is one possible solution where you turn the string into a tuple or an array and then use @unnest_wider.
@kdpsingh may be able to offer others that are better too

using TidierData, TidierStrings
# this function assumes theyre always (lo, hi)
function str_to_array(s::String)
        s_clean = strip(s, ['(', ')'])
        return parse.(Float64, split(s_clean, ","))
end

df = DataFrame(
    a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
    b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
)

@chain df begin
  @mutate(across(ends_with("CI"), (x -> str_to_array.(x))))
  @unnest_wider(ends_with("function"))
  @rename_with(x-> str_replace(x, "function1" ,"low"))
  @rename_with(x-> str_replace(x, "function2" ,"high"))
end
3×6 DataFrame
 Row │ a_CI        b_CI        a_CI_low  a_CI_high  b_CI_low  b_CI_high 
     │ String      String      Float64   Float64    Float64   Float64   
─────┼──────────────────────────────────────────────────────────────────
   1 │ (1.0, 2.0)  (0.1, 1.1)       1.0        2.0       0.1        1.1
   2 │ (2.5, 3.5)  (0.5, 1.5)       2.5        3.5       0.5        1.5
   3 │ (3.2, 4.1)  (1.2, 2.1)       3.2        4.1       1.2        2.1

this line could be used instead of two @rename_withs like above. I could possibly add a str_replace_many to TidierStrings for the future as well.

@chain df begin
  @mutate(across(ends_with("CI"), (x -> str_to_array.(x))))
  @unnest_wider(ends_with("function"))
  @rename_with(x-> replace(x, "function1" => "low", "function2" => "high"))
end
2 Likes

Now that you have a solution with TidierData I hope it’s not annoying to post a solution with (disclaimer: my package) DataFrameMacros, it’s a lot of fun to try and solve these data wrangling “code golfing” problems and I was happy to find a one-liner for this:

julia> using DataFrameMacros

julia> df = DataFrame(
           a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
           b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
       )
3×2 DataFrame
 Row │ a_CI        b_CI
     │ String      String
─────┼────────────────────────
   1 │ (1.0, 2.0)  (0.1, 1.1)
   2 │ (2.5, 3.5)  (0.5, 1.5)
   3 │ (3.2, 4.1)  (1.2, 2.1)

julia> @select df string.({}, ["_low", "_high"]) =
           [parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", {All()})]
3×4 DataFrame
 Row │ a_CI_low  a_CI_high  b_CI_low  b_CI_high
     │ Float64   Float64    Float64   Float64
─────┼──────────────────────────────────────────
   1 │      1.0        2.0       0.1        1.1
   2 │      2.5        3.5       0.5        1.5
   3 │      3.2        4.1       1.2        2.1

The string.({}, ["_low", "_high"]) part expands to [["a_CI_low", "a_CI_high"], ["b_CI_low", "b_CI_high"]] so in DataFrames minilanguage it’s like having [:a_CI, :b_CI] .=> the_function .=> [["a_CI_low", "a_CI_high"], ["b_CI_low", "b_CI_high"]].

7 Likes

Here’s how you could use basic DataFrame, borrowing your function to extract numbers from strings.

p(s)=[parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", s)]

transform(df, All().=>ByRow(p).=>x->[x*"_l",x*"_h"])

onelinear

transform(df, All().=>ByRow(s->[parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", s)]).=>x->[x*"_l",x*"_h"])

if you only want the transformed columns

select(df, All().=>ByRow(s->[parse(Float64, m.match) for m in
 eachmatch(r"\d+\.?\d+", s)]).=>x->[x*"_l",x*"_h"])

parsing in a different way

select(df, All().=>ByRow(s->parse.(Float64, split(s[2:end-1],','))) .=>x->[x*"_l",x*"_h"])
select(df, All().=>ByRow(s->parse.(Float64,split(s[2:end-1],',')))
.=>n->n .*["_l","_h"])
1 Like

Thanks for the question and for all the different ways people have shared to do this.

Here’s one way to do this using TidierData without the need to hardcode column names.

using TidierData, TidierStrings

# define the DataFrame
 df = DataFrame(
                  a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
                  b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
              )

# fixes a bug that will be addressed in the next version of TidierData
push!(TidierData.not_escaped[], :Meta)

# the code
@chain df begin
    @summarize(across(everything(), x -> eval.(Meta.parse.(x))))
    @summarize(across(everything(), (function low(x) minimum.(x) end,
                                     function high(x) maximum.(x) end)))
    @rename_with(x -> str_remove_all(x, "_function"), everything())
end

This gives the following output:

3×4 DataFrame
 Row │ a_CI_low  b_CI_low  a_CI_high  b_CI_high 
     │ Float64   Float64   Float64    Float64   
─────┼──────────────────────────────────────────
   1 │      1.0       0.1        2.0        1.1
   2 │      2.5       0.5        3.5        1.5
   3 │      3.2       1.2        4.1        2.1

If you’re less picky with the column names, this code can be simplified further.

The order of the columns is a bit clunky but am planning to fix that eventually (i.e., functions take precedence over columns in across()).

2 Likes