I’m sure there is probably a better approach I am not aware of but I am trying to add a new
NewDF to an existing
Arrow table while eliminating any rows in
NewDF that already exist in the
Something like this
database = #filepath dataDF = copy(DataFrame(Arrow.Table(database))) dataDF = vcat(NewDF, dataDF) unique!(dataDF) Arrow.write (database, dataDF)
Arrow file has grown too large and I am have a hard time copying the
Arrow table as a
DataFrame to make it mutable. (My understanding is that
Arrow files are otherwise immutable.)
My questions are:
Is there a better way to store and add to large data sets?
( Probably doing something wrong but I’m working with 64gb of ram and the
Arrowfile which is about 13GB often stalls or crashes my computer when I try to copy it to a DataFrame.)
Is there a way to efficiently check if the rows of
NewDFalready exist in
b: the different sizes of the
c: the different order of the rows?
dataDF = DataFrame(x = ["a","b","c"], y = [1,2,3] , z = [today(), today()+Day(1), today()+Day(2)] , a =[4.0,5.0,6.0] ) 3×4 DataFrame Row │ x y z a │ String Int64 Date Float64 ─────┼──────────────────────────────────── 1 │ a 1 2023-03-30 4.0 2 │ b 2 2023-03-31 5.0 3 │ c 3 2023-04-01 6.0 NewDF = DataFrame(x = ["c", "d"], y = [3, 8] , z = [today()+Day(2), today()+Day(3)] , a =[6.0, 7.0] ) 2×4 DataFrame Row │ x y z a │ String Int64 Date Float64 ─────┼──────────────────────────────────── 1 │ c 3 2023-04-01 6.0 2 │ d 8 2023-04-02 7.0
I would like the resulting
DataFrame to be something like
4×4 DataFrame Row │ x y z a │ String Int64 Date Float64 ─────┼──────────────────────────────────── 1 │ a 1 2023-03-30 4.0 2 │ b 2 2023-03-31 5.0 3 │ c 3 2023-04-01 6.0 4 | d 8 2023-04-02 7.0
I don’t think that
isequal in this situation. The only way I could think of doing it was creating a completely new
DataFrame with something like
function addunique(NewDF, dataDF) DF=DataFrame() for row in NewDF !in(NewDF[row,:], eachrow(dataDF)) && push!(DF, NewDF[row,:]) end DF end
Is this an efficient approach?
dataDF has hundreds of millions of rows so any suggestions would be greatly appreciated!
Also I know that
DF[:, :cols] creates a copy of
cols whereas using
! as the row selector in
DF[!,:cols] returns a view. Is there an equivalent for column selection in DataFrames because it seems in the function above I would be creating a lot of copies unnecessarily?