Create DataFrame from data in DataFrameRow

Nathan_Boyer · November 8, 2024, 9:38pm

I’m stuck on this data transformation.

I have a DataFrameRow that I looked up from another table.

julia> using DataFramesMeta
julia> dfrow = DataFrame(
           "Carbon Min. (%)" => 0.1,
           "Carbon Max. (%)" => 0.4,
           "Phosphorus Min. (%)" => 0,
           "Phosphorus Max. (%)" => 0.015,
       )[1,:]
DataFrameRow
 Row │ Carbon Min. (%)  Carbon Max. (%)  Phosphorus Min. (%)  Phosphorus Max. (%)
     │ Float64          Float64          Int64                Float64
─────┼────────────────────────────────────────────────────────────────────────────
   1 │             0.1              0.4                    0                0.015

I want to create a new DataFrame with the data in the form below.

julia> df = DataFrame(
           "Element" => ["Carbon", "Phosphorus"],
           "Min. (%)" => [0.1, 0.0],
           "Max. (%)" => [0.4, 0.015],
       )
2×3 DataFrame
 Row │ Element    Min. (%)  Max. (%)
     │ String      Float64   Float64
─────┼────────────────────────────────
   1 │ Carbon           0.1     0.4
   2 │ Phosphorus       0.0     0.015

I extracted the list of elements for the first DataFrame column, but I am struggling with the syntax to move the relevant values from dfrow into df.

julia> elements = unique(first.(split.(names(dfrow)))
2-element Vector{SubString{String}}:
 "Carbon"
 "Phosphorus"

I should also mention that there are missing % values in the data.

pdeffebach · November 8, 2024, 10:13pm

Two relevant issues in DataFrames about this.

I don’t know how I would do this without some sort of loop and vcat

rocco_sprmnt21 · November 8, 2024, 10:59pm

a bit of a tortuous path, but it seems to get the desired shape while remaining within the context of the DataFrames package

sdf=stack(DataFrame(dfrow),names(dfrow))

sdf1=select(sdf,:value, :variable => ByRow(r->split(r,limit=2))=>[:Element,:mM,])

unstack(sdf1,:mM,:value)

TZ.Neumann · November 8, 2024, 11:51pm

With @pdeffebach inputs, I have got this working.

elements = unique(first.(split.(names(row), ' '))) 
empty_vector = []

for elem in elements
    df = DataFrame(
        "Element" => String[],
        "Min. (%)" => Union{Float64,Missing}[],
        "Max. (%)" => Union{Float64,Missing}[]
    )
    min_key, max_key = "$elem Min. (%)", "$elem Max. (%)"
    push!(df, (elem, get(dfrow, min_key, missing), get(dfrow, max_key, missing)))
    push!(empty_vector, df)
end
df_new = vcat(empty_vector...)

alfaromartino · November 9, 2024, 3:57am

I think the easiest solution is using stack and unstack.

using DataFrames
dfrow = DataFrame(
           "Carbon Min. (%)" => 0.1,
           "Carbon Max. (%)" => 0.4,
           "Phosphorus Min. (%)" => 0,
           "Phosphorus Max. (%)" => 0.015,
       )

# permute dimensions
dff = stack(dfrow, :)

# extract element and measures from columns
dff.element = [split(x, " ")[1] for x in dff.variable]
dff.measure = [split(x, " ")[2] for x in dff.variable]

# reshape
df = unstack(dff, :element, :measure, :value)

with output

2×3 DataFrame
 Row │ element     Min.      Max.     
     │ SubStrin…   Float64?  Float64? 
─────┼────────────────────────────────
   1 │ Carbon           0.1     0.4
   2 │ Phosphorus       0.0     0.015

EDIT: same as @rocco_sprmnt21 's solution

rocco_sprmnt21 · November 9, 2024, 6:24pm

unstack also takes care of handling missed values

julia>  dfrow = DataFrame(
                   "Carbon Min. (%)" => 0.1,
                   "Carbon Max. (%)" => 0.4,
                   #="Phosphorus Min. (%)" => 0,=#
                   "Phosphorus Max. (%)" => 0.015,
               )[1,:]
DataFrameRow
 Row │ Carbon Min. (%)  Carbon Max. (%)  Phosphorus Max. (%) 
     │ Float64          Float64          Float64
─────┼───────────────────────────────────────────────────────
   1 │             0.1              0.4                0.015

julia> sdf=stack(DataFrame(dfrow))
3×2 DataFrame
 Row │ variable             value   
     │ String               Float64
─────┼──────────────────────────────
   1 │ Carbon Min. (%)        0.1
   2 │ Carbon Max. (%)        0.4
   3 │ Phosphorus Max. (%)    0.015

julia> sdf1=select(sdf,:value, :variable => ByRow(r->split(r,limit=2))=>[:Element,:mM,])
3×3 DataFrame
 Row │ value    Element     mM        
     │ Float64  SubStrin…   SubStrin…
─────┼────────────────────────────────
   1 │   0.1    Carbon      Min. (%)
   2 │   0.4    Carbon      Max. (%)
   3 │   0.015  Phosphorus  Max. (%)

julia> unstack(sdf1,:mM,:value)
2×3 DataFrame
 Row │ Element     Min. (%)   Max. (%) 
     │ SubStrin…   Float64?   Float64?
─────┼─────────────────────────────────
   1 │ Carbon            0.1     0.4
   2 │ Phosphorus  missing       0.015

julia> unstack(sdf1,:mM,:value,fill="unknown")
2×3 DataFrame
 Row │ Element     Min. (%)  Max. (%) 
     │ SubStrin…   Any       Any
─────┼────────────────────────────────
   1 │ Carbon      0.1       0.4
   2 │ Phosphorus  unknown   0.015

Dan · November 10, 2024, 1:55am

Nice solution. When trying it myself, also used stack like you did in:

There might be a possible feature addition to stack which could be useful in this case. The idea is to use:

julia> stack(DataFrame(dfrow),r"([a-zA-Z]+) (.+)")

which selects columns using a RegEx. But there is additional information in this RegEx which are capture groups. We can make stack use extracted capture groups and make them into columns in the output, essentially getting with the above RegEx the extra processing done with the select. Note that:

julia> match.(r"([a-zA-Z]+) (.+)", names(dfrow))
4-element Vector{RegexMatch{String}}:
 RegexMatch("Carbon Min. (%)", 1="Carbon", 2="Min. (%)")
 RegexMatch("Carbon Max. (%)", 1="Carbon", 2="Max. (%)")
 RegexMatch("Phosphorus Min. (%)", 1="Phosphorus", 2="Min. (%)")
 RegexMatch("Phosphorus Max. (%)", 1="Phosphorus", 2="Max. (%)")

This can often be useful when parsing numbered columns, which come as year1984 or some other weird format.

Is this a useful feature?

Nathan_Boyer · November 11, 2024, 4:44pm

Thanks all! I was having trouble figuring out the stack functions from the documentation, but these examples helped.

@pdeffebach you seem to have linked the same issue twice.

Topic		Replies	Views
Split dataframe row into multiple rows Data dataframes	8	1795	May 1, 2022
Excel's pivot table equivalent with DataFramesMeta General Usage excel , dataframesmeta	2	80	December 4, 2024
Unstack using row numbers Data question , dataframes	5	743	August 16, 2021
Transform DataFrame into a single row by adding row contents as new columns New to Julia dataframes	3	321	May 1, 2021
Split Column in Dataframe New to Julia dataframes	5	2192	February 4, 2022

Create DataFrame from data in DataFrameRow

Related topics