Hi! suppose I have the following GroupedDataFrame
GDF1 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,10),Date(2023,4,10)], Time = [3.85, 4.13]), :ID)
With Arrow.append
I can save each subdataframe
of a GroupedDataFrame
as a separate partition of an ArrowTable
with something like
File = "filepath"
for i in GDF1
Arrow.append(File, i)
end
I was just wondering is there a way to append to each partition after it is created? For example if I wanted to append
GDF2 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,12),Date(2023,4,12)], Time = [3.87, 4.14]), :ID)
To the created Arrow.Table
. If I try
File = "filepath"
for i in GDF2
Arrow.append(File, i)
end
I end up with an Arrow.Table
that has 4 partitions and looks like
View = DataFrame(Arrow.Table(File))
4ร3 DataFrame
Row โ ID Date Time
โ String Date Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ Eng1 2023-04-10 3.85
2 โ Eng2 2023-04-10 4.13
3 โ Eng1 2023-04-12 3.87
4 โ Eng2 2023-04-12 4.14
I would like the resulting Arrow.Table
to retain the initial 2 partitions with new Data added appended to each partition. i.e. the result I would have obtained had I done
DF1 = DataFrame(GDF1)
DF2 = DataFrame(GDF2)
DF = vcat(DF1,DF2)
GDF = groupby(DF, :ID)
File = "filepath"
for i in GDF
Arrow.append(File, i)
end
The resulting DataFrame
of the Arrow.Table
would look like
4ร3 DataFrame
Row โ ID Date Time
โ String Date Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ Eng1 2023-04-10 3.85
2 โ Eng1 2023-04-12 3.87
3 โ Eng2 2023-04-10 4.13
4 โ Eng2 2023-04-12 4.14
after the inclusion of each new DataFrame.
. I would like to accomplish this by appending to each existing partition because new data is added daily to a very large Arrow.Table
. Thus I am unable to use the previous method and calling sort
to the entire DataFrame
would be very slow or exceed the memory capacity of my machine.
Based on bkamins explanation of `Arrow.Streamโ I can try something like
where DoesSomeThingOnThisPartition would be to vcat
any new Data generated for the existing :ID
partition.
However I think this would require that I create a new Arrow.Table
in a new file rather than appending to the existing Table. I think this would be very tedious and quite storage intensive because it would have to be done daily and the file is quite large.
(I could alternately delete and create Input and output files but this would still be inefficient because even the many :ID partitions that had not updated on a given day would have to be re-copied.) So just wondering if there was a smarter way to accomplish this? Any insights would be greatly appreciated. Thank you so much!