Updating a data with new data, removing redundancies

Nash · July 10, 2020, 5:36pm

Suppose I have four corresponding arrays relating to stock market data. Also, suppose I assemble them into a matrix:

dataold = [mktId stockId totalVolume price timeStamp]

I proceed to gather more data of the stated kind, but at a different moment. This new data will be different at least in the timeStamp array, but may also be different elsewhere (most obviously price and totalVolume). Again, I assemble the data into matrix form:

datanew = [mktId stockId totalVolume price timeStamp]

The problem that I am trying to solve in the most efficient way possible is the following:

“How to I combine dataold and datanew to get dataall, such that dataold must not contain redundant data?”

By redundant data I mean that rows (in the matrix form [dataold datanew]) that are duplicates except in timeStamp are removed, such that only the first of such rows (i.e., the row with the smallest timeStamp) remain.

I am looking for a general strategy (what is most efficient? Do I need to assemble the matrices, or is there a much better way, for example?) in Julia.

Perhaps someone may even venture a snippet of code that does the job. That would be greatly appreciated!

pdeffebach · July 10, 2020, 5:47pm

This sounds like a job for leftjoin using DataFrames. Is there a reason you want to use a matrix? A data frame seems like a much more intuitive structure for this kind of thing.

Nash · July 10, 2020, 5:52pm

Only because matrices are intuitive for me (given my MATLAB background) and because I am not too familiar with DataFrames.

tbeason · July 10, 2020, 5:54pm

+1 for the suggestion of DataFrames! Doing something like this without seems quite painful. With DataFrames at the very minimum you could just create one long DataFrame (like your dataall) and then do groupby(df,[:mktId,:stockId]) to isolate the data for each stock and then do your filter.

The docs are very helpful Introduction · DataFrames.jl

Nash · July 10, 2020, 6:22pm

That is very helpful. I have groupby(df,[:mktId,:stockId]) working in my particular case. I may return with a question about how to apply the filter!

pdeffebach · July 10, 2020, 6:32pm

There is also a function unique which can delete rows that are the same for a certain set of columns.

Nash · July 10, 2020, 7:49pm

Can I ask you to apply the filter? In my case, the first step appears to be:

groupby(df,[:mktId,:stockId,:totalVolume,:price])

That procedure creates many groups. In a particular one of these groups, the elements mktId, stockId, totalVolume and price are the same, but the last element (timeStamps) can be different (if multiple timeStamps exists).

After the first step has been applied, I want each group to only have one timeStamp (the smallest datetime). So, each group should have only one line.

pdeffebach · July 10, 2020, 7:51pm

combine(groupby(df,[:mktId,:stockId,:totalVolume,:price])) do sdf
    sdf[1, :]
end

Nash · July 10, 2020, 7:56pm

So, you are taking the first index of each group only?

pdeffebach · July 10, 2020, 8:05pm

yes, combine will take all of these DataFrameRow objects and make them into one big DataFrame.

But you could also do

unique(df, [:mktId,:stockId,:totalVolume,:price])

which would be the same, I think.

tbeason · July 10, 2020, 8:47pm

Be sure the data is sorted properly because those methods will blindly pick the first row. But yes the last solution using unique is definitely the way to go in this case if I understand correctly what you want.

Topic		Replies	Views
A nice use case for DataFrames.jl - flexible dedup General Usage dataframes , tables , splitapplycombine	4	613	July 16, 2021
Remove all entries that occur more than once New to Julia dataframes	3	435	February 18, 2022
DataFrame Row Indexing and Transaction Data Aggregation General Usage	12	508	August 24, 2020
Manipulation of dataframe rows upon repeated values in a given column New to Julia	5	761	April 15, 2021
Changing many rows to single row julia1.5.3 Data question	8	635	December 13, 2020

Updating a data with new data, removing redundancies

Related topics