Hello All,
I’m trying to figure out a quick way to insert dummy rows with 0 values if they are missing for a set of index fields
For example, let’s say I have this dataframe:
data = DataFrame(:index => String[], :metric => String[], :values => Int64[])
push!(data, ["a", "x", 1])
push!(data, ["a", "y", 2])
push!(data, ["b", "x", 1])
julia> data
3×3 DataFrame
Row │ index metric values
│ String String Int64
─────┼────────────────────────
1 │ a x 1
2 │ a y 2
3 │ b x 1
For this table, I have metrics x,y for index A, but only metric X for index B. Is there a simple way to populate this table with index b, metric y, value = 0?
Of course, in my real table, there are thousands/millions of rows, so I don’t want to loop through these values one at a time to insert them. In my current program, I pivot the data wide, insert 0 values where they are missing and then depivot it back, but not sure if there is an easy way to do that/better way to do that in Julia DataFrames.
Thanks.
There is a keyword argument cols
in push!
where you can do cols = :union
do add new columns.
However you can’t control what the value will be, so you will need to do coalesce
after your loop.
Sorry I’m not quite sure what you mean by this. How can I use union in this case?
What you want is https://github.com/JuliaData/DataFrames.jl/pull/1864 but it was never finished. If you would find this functionality useful I can re-write this PR anew and add it.
It does sound like what I want. I’d use the function if it existed.
Thanks
For now you can have a look at the internals of the implementation - they should be relatively fast (but I do not guarantee it as it was written a long time ago).
Please read the docstring for push!
with ?push
. Here is an MWE that points you in the right direction.
julia> data = DataFrame(:index => String[], :metric => String[], :values => Int64[])
0×3 DataFrame
julia> datas = [(index = "a", metric = "x"), (index = "a", values = 1)];
julia> for d in datas
push!(data, d; cols = :union)
end
julia> data
2×3 DataFrame
Row │ index metric values
│ String String? Int64?
─────┼──────────────────────────
1 │ a x missing
2 │ a missing 1
My solution, (when columns are categorical unique
should be changed to levels
)
d = DataFrame([(index = i, metric = m)
for i in unique(data.index)
for m in unique(data.metric)])
leftjoin(d, data, on =[:index, :metric])
1 Like