~TidierData.jl~ - Apply Transform to Many Columns

mrufsvold · September 30, 2024, 4:09pm

I have a dataset with a bunch of columns with string data that have the format “(1.0, 2.0)”. All of these columns end with the postfix “CI”. I’d like to apply a function that splits the numbers into the low and high value and parses them as floats. Then I’d like to replace the original columns with $(original_name)_low and $(original_name)_low.

I know how I would do this with DataFrames.jl DSL, but I’d really like to achieve this with TiderData.jl because it makes my code more accessible for my R peers.

Any chance it’s possible without hard coding all the column names?

drizk1 · October 1, 2024, 8:47pm

here is one possible solution where you turn the string into a tuple or an array and then use @unnest_wider.
@kdpsingh may be able to offer others that are better too

using TidierData, TidierStrings
# this function assumes theyre always (lo, hi)
function str_to_array(s::String)
        s_clean = strip(s, ['(', ')'])
        return parse.(Float64, split(s_clean, ","))
end

df = DataFrame(
    a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
    b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
)

@chain df begin
  @mutate(across(ends_with("CI"), (x -> str_to_array.(x))))
  @unnest_wider(ends_with("function"))
  @rename_with(x-> str_replace(x, "function1" ,"low"))
  @rename_with(x-> str_replace(x, "function2" ,"high"))
end
3×6 DataFrame
 Row │ a_CI        b_CI        a_CI_low  a_CI_high  b_CI_low  b_CI_high 
     │ String      String      Float64   Float64    Float64   Float64   
─────┼──────────────────────────────────────────────────────────────────
   1 │ (1.0, 2.0)  (0.1, 1.1)       1.0        2.0       0.1        1.1
   2 │ (2.5, 3.5)  (0.5, 1.5)       2.5        3.5       0.5        1.5
   3 │ (3.2, 4.1)  (1.2, 2.1)       3.2        4.1       1.2        2.1

this line could be used instead of two @rename_withs like above. I could possibly add a str_replace_many to TidierStrings for the future as well.

@chain df begin
  @mutate(across(ends_with("CI"), (x -> str_to_array.(x))))
  @unnest_wider(ends_with("function"))
  @rename_with(x-> replace(x, "function1" => "low", "function2" => "high"))
end

jules · October 2, 2024, 8:23am

Now that you have a solution with TidierData I hope it’s not annoying to post a solution with (disclaimer: my package) DataFrameMacros, it’s a lot of fun to try and solve these data wrangling “code golfing” problems and I was happy to find a one-liner for this:

julia> using DataFrameMacros

julia> df = DataFrame(
           a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
           b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
       )
3×2 DataFrame
 Row │ a_CI        b_CI
     │ String      String
─────┼────────────────────────
   1 │ (1.0, 2.0)  (0.1, 1.1)
   2 │ (2.5, 3.5)  (0.5, 1.5)
   3 │ (3.2, 4.1)  (1.2, 2.1)

julia> @select df string.({}, ["_low", "_high"]) =
           [parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", {All()})]
3×4 DataFrame
 Row │ a_CI_low  a_CI_high  b_CI_low  b_CI_high
     │ Float64   Float64    Float64   Float64
─────┼──────────────────────────────────────────
   1 │      1.0        2.0       0.1        1.1
   2 │      2.5        3.5       0.5        1.5
   3 │      3.2        4.1       1.2        2.1

The string.({}, ["_low", "_high"]) part expands to [["a_CI_low", "a_CI_high"], ["b_CI_low", "b_CI_high"]] so in DataFrames minilanguage it’s like having [:a_CI, :b_CI] .=> the_function .=> [["a_CI_low", "a_CI_high"], ["b_CI_low", "b_CI_high"]].

rocco_sprmnt21 · October 2, 2024, 7:41pm

Here’s how you could use basic DataFrame, borrowing your function to extract numbers from strings.

p(s)=[parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", s)]

transform(df, All().=>ByRow(p).=>x->[x*"_l",x*"_h"])

onelinear

transform(df, All().=>ByRow(s->[parse(Float64, m.match) for m in eachmatch(r"\d+\.?\d+", s)]).=>x->[x*"_l",x*"_h"])

if you only want the transformed columns

select(df, All().=>ByRow(s->[parse(Float64, m.match) for m in
 eachmatch(r"\d+\.?\d+", s)]).=>x->[x*"_l",x*"_h"])

parsing in a different way

select(df, All().=>ByRow(s->parse.(Float64, split(s[2:end-1],','))) .=>x->[x*"_l",x*"_h"])

select(df, All().=>ByRow(s->parse.(Float64,split(s[2:end-1],',')))
.=>n->n .*["_l","_h"])

kdpsingh · October 12, 2024, 4:12am

Thanks for the question and for all the different ways people have shared to do this.

Here’s one way to do this using TidierData without the need to hardcode column names.

using TidierData, TidierStrings

# define the DataFrame
 df = DataFrame(
                  a_CI = ["(1.0, 2.0)", "(2.5, 3.5)", "(3.2, 4.1)"],
                  b_CI = ["(0.1, 1.1)", "(0.5, 1.5)", "(1.2, 2.1)"],
              )

# fixes a bug that will be addressed in the next version of TidierData
push!(TidierData.not_escaped[], :Meta)

# the code
@chain df begin
    @summarize(across(everything(), x -> eval.(Meta.parse.(x))))
    @summarize(across(everything(), (function low(x) minimum.(x) end,
                                     function high(x) maximum.(x) end)))
    @rename_with(x -> str_remove_all(x, "_function"), everything())
end

This gives the following output:

3×4 DataFrame
 Row │ a_CI_low  b_CI_low  a_CI_high  b_CI_high 
     │ Float64   Float64   Float64    Float64   
─────┼──────────────────────────────────────────
   1 │      1.0       0.1        2.0        1.1
   2 │      2.5       0.5        3.5        1.5
   3 │      3.2       1.2        4.1        2.1

If you’re less picky with the column names, this code can be simplified further.

The order of the columns is a bit clunky but am planning to fix that eventually (i.e., functions take precedence over columns in across()).

Topic		Replies	Views
DataFrame transform with many output columns General Usage dataframes	1	275	March 28, 2022
String column dataframe: replace for another string built from a substring from each row Data	1	1292	January 20, 2021
TidierData.jl ― Apply transform across columns while retaining access to other variables in the row General Usage tidier	7	182	June 27, 2025
Split Column in Dataframe New to Julia dataframes	5	2192	February 4, 2022
Expand row to multiple rows Data	11	1293	December 15, 2020

~TidierData.jl~ - Apply Transform to Many Columns

Related topics