Iβm sure there is probably a better approach I am not aware of but I am trying to add a new DataFrame
, NewDF
to an existing Arrow
table while eliminating any rows in NewDF
that already exist in the Arrow
file.
Something like this
database = #filepath
dataDF = copy(DataFrame(Arrow.Table(database)))
dataDF = vcat(NewDF, dataDF)
unique!(dataDF)
Arrow.write (database, dataDF)
However the Arrow
file has grown too large and I am have a hard time copying the Arrow
table as a DataFrame
to make it mutable. (My understanding is that Arrow
files are otherwise immutable.)
My questions are:
-
Is there a better way to store and add to large data sets?
( Probably doing something wrong but Iβm working with 64gb of ram and theArrow
file which is about 13GB often stalls or crashes my computer when I try to copy it to a DataFrame.) -
Is there a way to efficiently check if the rows of
NewDF
already exist indataDF
given
a:dataDF
is immutable
b: the different sizes of theDataFrames
c: the different order of the rows?
e.g.
dataDF = DataFrame(x = ["a","b","c"], y = [1,2,3] , z = [today(), today()+Day(1), today()+Day(2)] , a =[4.0,5.0,6.0] )
3Γ4 DataFrame
Row β x y z a
β String Int64 Date Float64
ββββββΌββββββββββββββββββββββββββββββββββββ
1 β a 1 2023-03-30 4.0
2 β b 2 2023-03-31 5.0
3 β c 3 2023-04-01 6.0
NewDF = DataFrame(x = ["c", "d"], y = [3, 8] , z = [today()+Day(2), today()+Day(3)] , a =[6.0, 7.0] )
2Γ4 DataFrame
Row β x y z a
β String Int64 Date Float64
ββββββΌββββββββββββββββββββββββββββββββββββ
1 β c 3 2023-04-01 6.0
2 β d 8 2023-04-02 7.0
I would like the resulting DataFrame
to be something like
4Γ4 DataFrame
Row β x y z a
β String Int64 Date Float64
ββββββΌββββββββββββββββββββββββββββββββββββ
1 β a 1 2023-03-30 4.0
2 β b 2 2023-03-31 5.0
3 β c 3 2023-04-01 6.0
4 | d 8 2023-04-02 7.0
I donβt think that isequal
in this situation. The only way I could think of doing it was creating a completely new DataFrame
with something like
function addunique(NewDF, dataDF)
DF=DataFrame()
for row in NewDF
!in(NewDF[row,:], eachrow(dataDF)) && push!(DF, NewDF[row,:])
end
DF
end
Is this an efficient approach? dataDF
has hundreds of millions of rows so any suggestions would be greatly appreciated!
Also I know that DF[:, :cols]
creates a copy of cols
whereas using !
as the row selector in DF[!,:cols]
returns a view. Is there an equivalent for column selection in DataFrames because it seems in the function above I would be creating a lot of copies unnecessarily?