How to update a data set using another data set?

I have a master data set and I like to update some of its values using a transaction data set. There are some key columns for finding the rows that must be updated in both data sets. I know about update! function but the catch for my problem is that I don’t like to replace the values in the master data set, instead I like to ,for example, add the values of the transaction data set to the values in the master data set.


julia> master=Dataset(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);

julia> transaction=Dataset(x=[1,3],y=[-1,1]);

I like


Dataset(x=[1,1,2,3,3],y=[missing,-1,2,4,0])

my data sets are huge and there are many key/value columns, i am looking for a scalable solution.

1 Like

My first idea would be to try a standard Dictionary. However, depending on your requirements DataStructures.jl offers a larger variety of data structures.

As a side note, your question is stated quite general. More context could help you get better feedback. E.g. if you are looking for something persistent or in-memory.

1 Like
using InMemoryDatasets
using Chain
master=Dataset(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);
transaction=Dataset(x=[1,3],y=[-1,1]);
@chain master begin
           compare(transaction,on=:x,eq=(x,y)->x+y)
           select!(Not([:obs_id_left,:obs_id_right]))
           rename!(first,_)
           update!(master, on=:x)
       end

I compared master and transaction with eq(x,y)=x+y and then used update! to update the output. it should be simpler than leftjoin to understand but i don’t think is faster than it.

1 Like

The above mentioned reply does not function on missing values.
``
using InMemoryDatasets
master=Dataset(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);

transaction=Dataset(x=[1,3],y=[-1,1]);
val_cols=[:y]
key_cols=[:x]
leftjoin!(master,view(transaction,:,key_cols),on=key_cols,obs_id=true)
function fun(x,y,z)
for i in eachindex(y)
if !ismissing(x[i])
y[i]+=z[x[i]];
end
end
y
end
transforms_list=[c=>x->fun(master[!,:obs_id_right],x,transaction[!,c] for c in val_cols]
modify!(master,transforms_list)
select!(master,Not([:obs_id_left,:obs_id_right]))

3 Likes

Splendid! I rewrote it for DataFrames.jl

using DataFrames
master=DataFrame(x=[1,1,2,3,3],y=[missing,0,2,3,-1]);
transaction=DataFrame(x=[1,3],y=[-1,1]);
val_cols=[:y]
key_cols=[:x]
transaction.obs_id_right = 1:nrow(transaction)
leftjoin!(master,view(transaction,:,[key_cols; :obs_id_right]),on=key_cols);
function fun(x,y,z)
       for i in eachindex(y)
           if !ismissing(x[i])
               y[i]+=z[x[i]];
           end
       end
       y
 end
transforms_list=[c=>(x->fun(master[!,:obs_id_right],x,transaction[!,c]))=>c for c in val_cols]
transform!(master,transforms_list)
select!(master,Not([:obs_id_right]))
2 Likes

Thanks for replies. I opened an issue and it’s added to InMemoryDatasets.jl

update(master, transaction, on=:x, op=+)
2 Likes