I’m stuck on this data transformation.
I have a DataFrameRow that I looked up from another table.
julia> using DataFramesMeta
julia> dfrow = DataFrame(
"Carbon Min. (%)" => 0.1,
"Carbon Max. (%)" => 0.4,
"Phosphorus Min. (%)" => 0,
"Phosphorus Max. (%)" => 0.015,
)[1,:]
DataFrameRow
Row │ Carbon Min. (%) Carbon Max. (%) Phosphorus Min. (%) Phosphorus Max. (%)
│ Float64 Float64 Int64 Float64
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 0.1 0.4 0 0.015
I want to create a new DataFrame with the data in the form below.
julia> df = DataFrame(
"Element" => ["Carbon", "Phosphorus"],
"Min. (%)" => [0.1, 0.0],
"Max. (%)" => [0.4, 0.015],
)
2×3 DataFrame
Row │ Element Min. (%) Max. (%)
│ String Float64 Float64
─────┼────────────────────────────────
1 │ Carbon 0.1 0.4
2 │ Phosphorus 0.0 0.015
I extracted the list of elements for the first DataFrame column, but I am struggling with the syntax to move the relevant values from dfrow
into df
.
julia> elements = unique(first.(split.(names(dfrow)))
2-element Vector{SubString{String}}:
"Carbon"
"Phosphorus"
I should also mention that there are missing
% values in the data.
Two relevant issues in DataFrames about this.
I don’t know how I would do this without some sort of loop
and vcat
a bit of a tortuous path, but it seems to get the desired shape while remaining within the context of the DataFrames package
sdf=stack(DataFrame(dfrow),names(dfrow))
sdf1=select(sdf,:value, :variable => ByRow(r->split(r,limit=2))=>[:Element,:mM,])
unstack(sdf1,:mM,:value)
With @pdeffebach inputs, I have got this working.
elements = unique(first.(split.(names(row), ' ')))
empty_vector = []
for elem in elements
df = DataFrame(
"Element" => String[],
"Min. (%)" => Union{Float64,Missing}[],
"Max. (%)" => Union{Float64,Missing}[]
)
min_key, max_key = "$elem Min. (%)", "$elem Max. (%)"
push!(df, (elem, get(dfrow, min_key, missing), get(dfrow, max_key, missing)))
push!(empty_vector, df)
end
df_new = vcat(empty_vector...)
I think the easiest solution is using stack
and unstack
.
using DataFrames
dfrow = DataFrame(
"Carbon Min. (%)" => 0.1,
"Carbon Max. (%)" => 0.4,
"Phosphorus Min. (%)" => 0,
"Phosphorus Max. (%)" => 0.015,
)
# permute dimensions
dff = stack(dfrow, :)
# extract element and measures from columns
dff.element = [split(x, " ")[1] for x in dff.variable]
dff.measure = [split(x, " ")[2] for x in dff.variable]
# reshape
df = unstack(dff, :element, :measure, :value)
with output
2×3 DataFrame
Row │ element Min. Max.
│ SubStrin… Float64? Float64?
─────┼────────────────────────────────
1 │ Carbon 0.1 0.4
2 │ Phosphorus 0.0 0.015
EDIT: same as @rocco_sprmnt21 's solution
1 Like
unstack also takes care of handling missed values
julia> dfrow = DataFrame(
"Carbon Min. (%)" => 0.1,
"Carbon Max. (%)" => 0.4,
#="Phosphorus Min. (%)" => 0,=#
"Phosphorus Max. (%)" => 0.015,
)[1,:]
DataFrameRow
Row │ Carbon Min. (%) Carbon Max. (%) Phosphorus Max. (%)
│ Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────
1 │ 0.1 0.4 0.015
julia> sdf=stack(DataFrame(dfrow))
3×2 DataFrame
Row │ variable value
│ String Float64
─────┼──────────────────────────────
1 │ Carbon Min. (%) 0.1
2 │ Carbon Max. (%) 0.4
3 │ Phosphorus Max. (%) 0.015
julia> sdf1=select(sdf,:value, :variable => ByRow(r->split(r,limit=2))=>[:Element,:mM,])
3×3 DataFrame
Row │ value Element mM
│ Float64 SubStrin… SubStrin…
─────┼────────────────────────────────
1 │ 0.1 Carbon Min. (%)
2 │ 0.4 Carbon Max. (%)
3 │ 0.015 Phosphorus Max. (%)
julia> unstack(sdf1,:mM,:value)
2×3 DataFrame
Row │ Element Min. (%) Max. (%)
│ SubStrin… Float64? Float64?
─────┼─────────────────────────────────
1 │ Carbon 0.1 0.4
2 │ Phosphorus missing 0.015
julia> unstack(sdf1,:mM,:value,fill="unknown")
2×3 DataFrame
Row │ Element Min. (%) Max. (%)
│ SubStrin… Any Any
─────┼────────────────────────────────
1 │ Carbon 0.1 0.4
2 │ Phosphorus unknown 0.015
2 Likes
Dan
November 10, 2024, 1:55am
7
Nice solution. When trying it myself, also used stack
like you did in:
There might be a possible feature addition to stack
which could be useful in this case. The idea is to use:
julia> stack(DataFrame(dfrow),r"([a-zA-Z]+) (.+)")
which selects columns using a RegEx. But there is additional information in this RegEx which are capture groups. We can make stack
use extracted capture groups and make them into columns in the output, essentially getting with the above RegEx the extra processing done with the select
. Note that:
julia> match.(r"([a-zA-Z]+) (.+)", names(dfrow))
4-element Vector{RegexMatch{String}}:
RegexMatch("Carbon Min. (%)", 1="Carbon", 2="Min. (%)")
RegexMatch("Carbon Max. (%)", 1="Carbon", 2="Max. (%)")
RegexMatch("Phosphorus Min. (%)", 1="Phosphorus", 2="Min. (%)")
RegexMatch("Phosphorus Max. (%)", 1="Phosphorus", 2="Max. (%)")
This can often be useful when parsing numbered columns, which come as year1984
or some other weird format.
Is this a useful feature?
Thanks all! I was having trouble figuring out the stack
functions from the documentation, but these examples helped.
@pdeffebach you seem to have linked the same issue twice.