I have a master data set and I like to update some of its values using a transaction data set. There are some key columns for finding the rows that must be updated in both data sets. I know about update!
function but the catch for my problem is that I don’t like to replace the values in the master data set, instead I like to ,for example, add the values of the transaction data set to the values in the master data set.
julia> master=Dataset(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);
julia> transaction=Dataset(x=[1,3],y=[-1,1]);
I like
Dataset(x=[1,1,2,3,3],y=[missing,-1,2,4,0])
my data sets are huge and there are many key/value columns, i am looking for a scalable solution.
1 Like
My first idea would be to try a standard Dictionary. However, depending on your requirements DataStructures.jl offers a larger variety of data structures.
As a side note, your question is stated quite general. More context could help you get better feedback. E.g. if you are looking for something persistent or in-memory.
1 Like
using InMemoryDatasets
using Chain
master=Dataset(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);
transaction=Dataset(x=[1,3],y=[-1,1]);
@chain master begin
compare(transaction,on=:x,eq=(x,y)->x+y)
select!(Not([:obs_id_left,:obs_id_right]))
rename!(first,_)
update!(master, on=:x)
end
I compared master and transaction with eq(x,y)=x+y and then used update! to update the output. it should be simpler than leftjoin to understand but i don’t think is faster than it.
1 Like
The above mentioned reply does not function on missing values.
``
using InMemoryDatasets
master=Dataset(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);
transaction=Dataset(x=[1,3],y=[-1,1]);
val_cols=[:y]
key_cols=[:x]
leftjoin!(master,view(transaction,:,key_cols),on=key_cols,obs_id=true)
function fun(x,y,z)
for i in eachindex(y)
if !ismissing(x[i])
y[i]+=z[x[i]];
end
end
y
end
transforms_list=[c=>x->fun(master[!,:obs_id_right],x,transaction[!,c] for c in val_cols]
modify!(master,transforms_list)
select!(master,Not([:obs_id_left,:obs_id_right]))
3 Likes
Splendid! I rewrote it for DataFrames.jl
using DataFrames
master=DataFrame(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);
transaction=DataFrame(x=[1,3],y=[-1,1]);
val_cols=[:y]
key_cols=[:x]
transaction.obs_id_right = 1:nrow(transaction)
leftjoin!(master,view(transaction,:,[key_cols; :obs_id_right]),on=key_cols);
function fun(x,y,z)
for i in eachindex(y)
if !ismissing(x[i])
y[i]+=z[x[i]];
end
end
y
end
transforms_list=[c=>(x->fun(master[!,:obs_id_right],x,transaction[!,c]))=>c for c in val_cols]
transform!(master,transforms_list)
select!(master,Not([:obs_id_right]))
2 Likes
Thanks for replies. I opened an issue and it’s added to InMemoryDatasets.jl
update(master, transaction, on=:x, op=+)
2 Likes