Create DataFrame from data in DataFrameRow

I’m stuck on this data transformation.

I have a DataFrameRow that I looked up from another table.

julia> using DataFramesMeta
julia> dfrow = DataFrame(
           "Carbon Min. (%)" => 0.1,
           "Carbon Max. (%)" => 0.4,
           "Phosphorus Min. (%)" => 0,
           "Phosphorus Max. (%)" => 0.015,
       )[1,:]
DataFrameRow
 Row │ Carbon Min. (%)  Carbon Max. (%)  Phosphorus Min. (%)  Phosphorus Max. (%)
     │ Float64          Float64          Int64                Float64
─────┼────────────────────────────────────────────────────────────────────────────
   1 │             0.1              0.4                    0                0.015

I want to create a new DataFrame with the data in the form below.

julia> df = DataFrame(
           "Element" => ["Carbon", "Phosphorus"],
           "Min. (%)" => [0.1, 0.0],
           "Max. (%)" => [0.4, 0.015],
       )
2×3 DataFrame
 Row │ Element    Min. (%)  Max. (%)
     │ String      Float64   Float64
─────┼────────────────────────────────
   1 │ Carbon           0.1     0.4
   2 │ Phosphorus       0.0     0.015

I extracted the list of elements for the first DataFrame column, but I am struggling with the syntax to move the relevant values from dfrow into df.

julia> elements = unique(first.(split.(names(dfrow)))
2-element Vector{SubString{String}}:
 "Carbon"
 "Phosphorus"

I should also mention that there are missing % values in the data.

Two relevant issues in DataFrames about this.

I don’t know how I would do this without some sort of loop and vcat

a bit of a tortuous path, but it seems to get the desired shape while remaining within the context of the DataFrames package

sdf=stack(DataFrame(dfrow),names(dfrow))

sdf1=select(sdf,:value, :variable => ByRow(r->split(r,limit=2))=>[:Element,:mM,])

unstack(sdf1,:mM,:value)

With @pdeffebach inputs, I have got this working.

elements = unique(first.(split.(names(row), ' '))) 
empty_vector = []

for elem in elements
    df = DataFrame(
        "Element" => String[],
        "Min. (%)" => Union{Float64,Missing}[],
        "Max. (%)" => Union{Float64,Missing}[]
    )
    min_key, max_key = "$elem Min. (%)", "$elem Max. (%)"
    push!(df, (elem, get(dfrow, min_key, missing), get(dfrow, max_key, missing)))
    push!(empty_vector, df)
end
df_new = vcat(empty_vector...)

I think the easiest solution is using stack and unstack.

using DataFrames
dfrow = DataFrame(
           "Carbon Min. (%)" => 0.1,
           "Carbon Max. (%)" => 0.4,
           "Phosphorus Min. (%)" => 0,
           "Phosphorus Max. (%)" => 0.015,
       )

# permute dimensions
dff = stack(dfrow, :)

# extract element and measures from columns
dff.element = [split(x, " ")[1] for x in dff.variable]
dff.measure = [split(x, " ")[2] for x in dff.variable]

# reshape
df = unstack(dff, :element, :measure, :value)

with output

2×3 DataFrame
 Row │ element     Min.      Max.     
     │ SubStrin…   Float64?  Float64? 
─────┼────────────────────────────────
   1 │ Carbon           0.1     0.4
   2 │ Phosphorus       0.0     0.015

EDIT: same as @rocco_sprmnt21 's solution

1 Like

unstack also takes care of handling missed values

julia>  dfrow = DataFrame(
                   "Carbon Min. (%)" => 0.1,
                   "Carbon Max. (%)" => 0.4,
                   #="Phosphorus Min. (%)" => 0,=#
                   "Phosphorus Max. (%)" => 0.015,
               )[1,:]
DataFrameRow
 Row │ Carbon Min. (%)  Carbon Max. (%)  Phosphorus Max. (%) 
     │ Float64          Float64          Float64
─────┼───────────────────────────────────────────────────────
   1 │             0.1              0.4                0.015

julia> sdf=stack(DataFrame(dfrow))
3×2 DataFrame
 Row │ variable             value   
     │ String               Float64
─────┼──────────────────────────────
   1 │ Carbon Min. (%)        0.1
   2 │ Carbon Max. (%)        0.4
   3 │ Phosphorus Max. (%)    0.015

julia> sdf1=select(sdf,:value, :variable => ByRow(r->split(r,limit=2))=>[:Element,:mM,])
3×3 DataFrame
 Row │ value    Element     mM        
     │ Float64  SubStrin…   SubStrin…
─────┼────────────────────────────────
   1 │   0.1    Carbon      Min. (%)
   2 │   0.4    Carbon      Max. (%)
   3 │   0.015  Phosphorus  Max. (%)

julia> unstack(sdf1,:mM,:value)
2×3 DataFrame
 Row │ Element     Min. (%)   Max. (%) 
     │ SubStrin…   Float64?   Float64?
─────┼─────────────────────────────────
   1 │ Carbon            0.1     0.4
   2 │ Phosphorus  missing       0.015

julia> unstack(sdf1,:mM,:value,fill="unknown")
2×3 DataFrame
 Row │ Element     Min. (%)  Max. (%) 
     │ SubStrin…   Any       Any
─────┼────────────────────────────────
   1 │ Carbon      0.1       0.4
   2 │ Phosphorus  unknown   0.015
2 Likes

Nice solution. When trying it myself, also used stack like you did in:

There might be a possible feature addition to stack which could be useful in this case. The idea is to use:

julia> stack(DataFrame(dfrow),r"([a-zA-Z]+) (.+)")

which selects columns using a RegEx. But there is additional information in this RegEx which are capture groups. We can make stack use extracted capture groups and make them into columns in the output, essentially getting with the above RegEx the extra processing done with the select. Note that:

julia> match.(r"([a-zA-Z]+) (.+)", names(dfrow))
4-element Vector{RegexMatch{String}}:
 RegexMatch("Carbon Min. (%)", 1="Carbon", 2="Min. (%)")
 RegexMatch("Carbon Max. (%)", 1="Carbon", 2="Max. (%)")
 RegexMatch("Phosphorus Min. (%)", 1="Phosphorus", 2="Min. (%)")
 RegexMatch("Phosphorus Max. (%)", 1="Phosphorus", 2="Max. (%)")

This can often be useful when parsing numbered columns, which come as year1984 or some other weird format.

Is this a useful feature?

Thanks all! I was having trouble figuring out the stack functions from the documentation, but these examples helped.

@pdeffebach you seem to have linked the same issue twice.