Updating a data with new data, removing redundancies

Suppose I have four corresponding arrays relating to stock market data. Also, suppose I assemble them into a matrix:

dataold = [mktId stockId totalVolume price timeStamp]

I proceed to gather more data of the stated kind, but at a different moment. This new data will be different at least in the timeStamp array, but may also be different elsewhere (most obviously price and totalVolume). Again, I assemble the data into matrix form:

datanew = [mktId stockId totalVolume price timeStamp]

The problem that I am trying to solve in the most efficient way possible is the following:

“How to I combine dataold and datanew to get dataall, such that dataold must not contain redundant data?”

By redundant data I mean that rows (in the matrix form [dataold datanew]) that are duplicates except in timeStamp are removed, such that only the first of such rows (i.e., the row with the smallest timeStamp) remain.

I am looking for a general strategy (what is most efficient? Do I need to assemble the matrices, or is there a much better way, for example?) in Julia.

Perhaps someone may even venture a snippet of code that does the job. That would be greatly appreciated!

This sounds like a job for leftjoin using DataFrames. Is there a reason you want to use a matrix? A data frame seems like a much more intuitive structure for this kind of thing.

1 Like

Only because matrices are intuitive for me (given my MATLAB background) and because I am not too familiar with DataFrames.

+1 for the suggestion of DataFrames! Doing something like this without seems quite painful. With DataFrames at the very minimum you could just create one long DataFrame (like your dataall) and then do groupby(df,[:mktId,:stockId]) to isolate the data for each stock and then do your filter.

The docs are very helpful Introduction · DataFrames.jl

That is very helpful. I have groupby(df,[:mktId,:stockId]) working in my particular case. I may return with a question about how to apply the filter!

1 Like

There is also a function unique which can delete rows that are the same for a certain set of columns.

1 Like

Can I ask you to apply the filter? In my case, the first step appears to be:

groupby(df,[:mktId,:stockId,:totalVolume,:price])

That procedure creates many groups. In a particular one of these groups, the elements mktId, stockId, totalVolume and price are the same, but the last element (timeStamps) can be different (if multiple timeStamps exists).

After the first step has been applied, I want each group to only have one timeStamp (the smallest datetime). So, each group should have only one line.

combine(groupby(df,[:mktId,:stockId,:totalVolume,:price])) do sdf
    sdf[1, :]
end

So, you are taking the first index of each group only?

yes, combine will take all of these DataFrameRow objects and make them into one big DataFrame.

But you could also do

unique(df, [:mktId,:stockId,:totalVolume,:price])

which would be the same, I think.

1 Like

Be sure the data is sorted properly because those methods will blindly pick the first row. But yes the last solution using unique is definitely the way to go in this case if I understand correctly what you want.

2 Likes