Append a `DataFrame` to a partition of an existing `ArrowTable` without creating a new `ArrowTable`?

Hi! suppose I have the following GroupedDataFrame

GDF1 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,10),Date(2023,4,10)], Time = [3.85, 4.13]), :ID)

With Arrow.append I can save each subdataframe of a GroupedDataFrame as a separate partition of an ArrowTable with something like

File = "filepath"
for i in GDF1
    Arrow.append(File, i)
end

I was just wondering is there a way to append to each partition after it is created? For example if I wanted to append

GDF2 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,12),Date(2023,4,12)], Time = [3.87, 4.14]), :ID)

To the created Arrow.Table. If I try

File = "filepath"
for i in GDF2
    Arrow.append(File, i)
end

I end up with an Arrow.Table that has 4 partitions and looks like

View = DataFrame(Arrow.Table(File))
4ร—3 DataFrame
 Row โ”‚ ID      Date        Time    
     โ”‚ String  Date        Float64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ Eng1    2023-04-10     3.85
   2 โ”‚ Eng2    2023-04-10     4.13
   3 โ”‚ Eng1    2023-04-12     3.87
   4 โ”‚ Eng2    2023-04-12     4.14

I would like the resulting Arrow.Table to retain the initial 2 partitions with new Data added appended to each partition. i.e. the result I would have obtained had I done

DF1 = DataFrame(GDF1)
DF2 = DataFrame(GDF2)
DF = vcat(DF1,DF2)
GDF = groupby(DF, :ID)
File = "filepath"

for i in GDF
    Arrow.append(File, i)
end

The resulting DataFrame of the Arrow.Table would look like

4ร—3 DataFrame
 Row โ”‚ ID      Date        Time    
     โ”‚ String  Date        Float64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ Eng1    2023-04-10     3.85
   2 โ”‚ Eng1    2023-04-12     3.87
   3 โ”‚ Eng2    2023-04-10     4.13
   4 โ”‚ Eng2    2023-04-12     4.14

after the inclusion of each new DataFrame.. I would like to accomplish this by appending to each existing partition because new data is added daily to a very large Arrow.Table. Thus I am unable to use the previous method and calling sort to the entire DataFrame would be very slow or exceed the memory capacity of my machine.

Based on bkamins explanation of `Arrow.Streamโ€™ I can try something like

where DoesSomeThingOnThisPartition would be to vcat any new Data generated for the existing :ID partition.

However I think this would require that I create a new Arrow.Table in a new file rather than appending to the existing Table. I think this would be very tedious and quite storage intensive because it would have to be done daily and the file is quite large.

(I could alternately delete and create Input and output files but this would still be inefficient because even the many :ID partitions that had not updated on a given day would have to be re-copied.) So just wondering if there was a smarter way to accomplish this? Any insights would be greatly appreciated. Thank you so much!

waiting for a more specific solution, you could adapt this scheme to your case


function appendgroup(grp)
    for (i,g) in enumerate(grp)
        k=last(keys(grp)[i])
        Arrow.append("partition_$k"*".arrow", g)
    end
end

If you want to put it all together


arrow_files=["partition_$k"*".arrow" for k in last.(keys(grp))]
Arrow.write("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))

PS
I tried to append a file to a file created with Arrow.write, but failed despite following the instructions below

Append any Tables.jl-compatible tbl to an existing arrow formatted file or IO. The existing arrow data must be in IPC stream format. Note that appending to the โ€œfeather formatted fileโ€ is not allowed, as this file format doesnโ€™t support appending. That means files written like Arrow.write(filename::String, tbl) cannot be appended to; instead, you should write like Arrow.write(filename::String, tbl; file=false).

1 Like

Hey thanks so much for pointing this out! So with the above function, am I writing each SubDataFrame of a GroupedDataFrame to a separate Arrow.Table and then concatenating the individual tables? Because then I would still have to somehow match the New Data with the existing Arrow.Table by combining the entries and deleting the redundancies. Which I donโ€™t know if its possible without bringing the Arrow.Table into memory?

In general is it better to have a bunch of small Arrow.Tables as opposed to one massive one?

What I would like to end up with is something like (clunkifying bkaminsโ€™ code above)

MassivePartionedTable_Input = Arrow.Stream("inputFile.arrow")
vehicle = unique(DataFrame(GDF2).ID) 
for eachPartition in MassivePartionedTable_Input
    df_eachPartition  = DataFrame(eachPartition) 
    id = df_eachPartition.ID[1] 
    if id in(vehicle)
        newPartition = vcat(GDF2[(id,)], df_eachPartition)
        Arrow.append("outputFile.arrow", newPartition)
        setdiff!(vehicle, [id]) 
    else 
        Arrow.append("outputFile.arrow", df_eachPartition)
    end
end 

but ideally โ€œinputFileโ€ could be the โ€œoutputFileโ€ so I could append to the existing table as opposed to constantly creating a new one via individual partition.

Thatโ€™s strange. I am using Arrow v2.5.0 with DataFrames v1.5.0 and got the following to work

Arrow.write("filepath",DF1; file = false)
Arrow.append("filepath",DF2)

Where DF1 and DF2 are DataFrames of matching schema. But with a GroupedDataFrame I had to use

for g = GDF1
arrow.append("filepath",g)
end

for g = GDF2 
arrow.append("filepath",g)
end

to append because I guess Arrow.Table squishes the entries of each SubDataFrame into an array so the schema between GDF1 and GDF2 are mismatched.

Iโ€™m not sure I understood what the basic need is, (nor do I have experience using the arrow package.
Iโ€™m just trying to learn something while we discuss the case,) but Iโ€™ll try to explain better with an example what I meant in the brief answer above.

julia> function appendgroup(grp)
           for (i,g) in enumerate(grp)
               k=join(values(keys(grp)[i]),'_')
               Arrow.append("partition_$k"*".arrow", g)
           end
       end
appendgroup (generic function with 1 method)

julia> data1 = [(ID=rand('a':'d'), val=rand(-100:0)) for _ in 1:40];

julia> df1=DataFrame(data1);

julia> grp1=groupby(df1,:ID);

julia> appendgroup(grp1)

julia> DataFrame(Arrow.Table("partition_d.arrow"))
9ร—2 DataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ d       -88
   2 โ”‚ d       -31
   3 โ”‚ d        -8
   4 โ”‚ d       -91
   5 โ”‚ d       -56
   6 โ”‚ d       -24
   7 โ”‚ d       -93
   8 โ”‚ d       -19
   9 โ”‚ d        -1

julia> data2 = [(ID=rand('b':'e'), val=rand(1:100)) for _ in 1:20];

julia> df2=DataFrame(data2);

julia> grp2=groupby(df2,:ID);

julia> appendgroup(grp2)

julia> DataFrame(Arrow.Table("partition_a.arrow"))
8ร—2 DataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ a       -63
   2 โ”‚ a       -98
   3 โ”‚ a        -3
   4 โ”‚ a        -6
   5 โ”‚ a       -40
   6 โ”‚ a       -84
   7 โ”‚ a       -49
   8 โ”‚ a       -57

julia> DataFrame(Arrow.Table("partition_d.arrow"))
11ร—2 DataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ d       -88
   2 โ”‚ d       -31
   3 โ”‚ d        -8
   4 โ”‚ d       -91
   5 โ”‚ d       -56
   6 โ”‚ d       -24
   7 โ”‚ d       -93
   8 โ”‚ d       -19
   9 โ”‚ d        -1
  10 โ”‚ d        71
  11 โ”‚ d        42

julia> DataFrame(Arrow.Table("partition_e.arrow"))
8ร—2 DataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ e        95
   2 โ”‚ e        18
   3 โ”‚ e        12
   4 โ”‚ e        12
   5 โ”‚ e        51
   6 โ”‚ e         5
   7 โ”‚ e        18
   8 โ”‚ e        57

julia> ids=Set([values.(keys(grp1))...,values.(keys(grp2))...])    
Set{Tuple{Char}} with 5 elements:
  ('a',)
  ('e',)
  ('c',)
  ('d',)
  ('b',)

julia> arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
5-element Vector{String}:
 "partition_a.arrow"
 "partition_e.arrow"
 "partition_c.arrow"
 "partition_d.arrow"
 "partition_b.arrow"

julia> Arrow.write("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
"concatenated.arrow"

I found here some references to the functions used. In particular, with regard to the last one, I carry a comment from the author:

Note that this โ€œfunctionalโ€ form of Tables.partitioner applies the mapping function lazily as each partition is processed, to avoid having to load all the arrow tables at once in memory.

concatenated_arrow
julia> DataFrame(Arrow.Table("concatenated.arrow"))
60ร—2 DataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ a       -63
   2 โ”‚ a       -98
   3 โ”‚ a        -3
   4 โ”‚ a        -6
   5 โ”‚ a       -40
   6 โ”‚ a       -84
   7 โ”‚ a       -49
   8 โ”‚ a       -57
   9 โ”‚ e        95
  10 โ”‚ e        18
  11 โ”‚ e        12
  12 โ”‚ e        12
  13 โ”‚ e        51
  14 โ”‚ e         5
  15 โ”‚ e        18
  16 โ”‚ e        57
  17 โ”‚ c       -82
  18 โ”‚ c       -86
  โ‹ฎ  โ”‚  โ‹ฎ      โ‹ฎ
  43 โ”‚ b       -39
  44 โ”‚ b       -92
  45 โ”‚ b        -1
  46 โ”‚ b       -95
  47 โ”‚ b       -87
  48 โ”‚ b       -75
  49 โ”‚ b       -99
  50 โ”‚ b       -39
  51 โ”‚ b       -31
  52 โ”‚ b       -32
  53 โ”‚ b       -96
  54 โ”‚ b        95
  55 โ”‚ b        76
  56 โ”‚ b        48
  57 โ”‚ b        64
  58 โ”‚ b        13
  59 โ”‚ b        87
  60 โ”‚ b        69
    24 rows omitted
1 Like

Thanks so much for taking the time to clarify the issue despite my lack of clarity in stating the use case. Itโ€™s really helpful in sorting out my thinking. Borrowing from your example above, what I mean is suppose I already have concatenated.arrow saved as an the Arrow.Table with the given partitions a, b, c, d, and ,e. (i.e. appendgroup() has already been run on grp1) Now I would like to add additional data directly to partitions a,c and e as opposed to appending the data to the end of concatenated.arrow and then sorting the entire Arrow.Table. Just wondering what the best work around for that would be if I did not want to constantly create and save new Arrow.Table as per bkamins example above?


data1 = [(ID=rand('a':'d'), val=rand(-100:0)) for _ in 1:40];
df1=DataFrame(data1);
grp1=groupby(df1,:ID);
appendgroup(grp1)

ids=[values.(keys(grp1))...]

arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))

While waiting for someone who knows more about how these packages work, Iโ€™m submitting another thought to you.
Assuming it is possible to start from a situation like the previuos where you have an arrow file obtained from the following expression
Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
(if what you have is not like that, maybe you could create it once and for all, starting from the file you have by loading it into memory by making the groups (grps) and saving them with appendgrp(grps) and then using the expression.
Having this starting situation, the append you ask for can be done like this


data2 = [(ID=rand(['a','c','e']), val=rand(1:100)) for _ in 1:15]
grp2=groupby(DataFrame(data2),:ID)
appendgroup(grp2)


ids=[values.(keys(grp2))...]

arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))

History Log
julia> using Arrow, DataFrames

julia> using Tables

julia> function appendgroup(grp)
           for (i,g) in enumerate(grp)
               k=join(values(keys(grp)[i]),'_')
               Arrow.append("partition_$k"*".arrow", g)
           end
       end
appendgroup (generic function with 1 method)

julia> data1 = [(ID=rand('a':'d'), val=rand(-100:0)) for _ in 1:40];

julia> df1=DataFrame(data1);

julia> grp1=groupby(df1,:ID);

julia> appendgroup(grp1)

julia> ids=[values.(keys(grp1))...]
4-element Vector{Tuple{Char}}:
 ('d',)
 ('b',)
 ('a',)
 ('c',)

julia> arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
4-element Vector{String}:
 "partition_d.arrow"
 "partition_b.arrow"
 "partition_a.arrow"
 "partition_c.arrow"

julia> Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
"concatenated.arrow"

julia> data2 = [(ID=rand(['a','c','e']), val=rand(1:100)) for _ in 1:15]
15-element Vector{NamedTuple{(:ID, :val), Tuple{Char, Int64}}}:
 (ID = 'e', val = 38)
 (ID = 'c', val = 66)
 (ID = 'e', val = 8)
 (ID = 'a', val = 87)
 (ID = 'e', val = 71)
 (ID = 'a', val = 93)
 (ID = 'e', val = 3)
 (ID = 'a', val = 68)
 (ID = 'c', val = 3)
 (ID = 'e', val = 89)
 (ID = 'e', val = 41)
 (ID = 'c', val = 89)
 (ID = 'e', val = 48)
 (ID = 'a', val = 18)
 (ID = 'a', val = 25)

julia> grp2=groupby(DataFrame(data2),:ID)
GroupedDataFrame with 3 groups based on key: ID
First Group (7 rows): ID = 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ e        38
   2 โ”‚ e         8
   3 โ”‚ e        71
   4 โ”‚ e         3
   5 โ”‚ e        89
   6 โ”‚ e        41
   7 โ”‚ e        48
โ‹ฎ
Last Group (5 rows): ID = 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ a        87
   2 โ”‚ a        93
   3 โ”‚ a        68
   4 โ”‚ a        18
   5 โ”‚ a        25

julia> appendgroup(grp2)

julia> ids=[values.(keys(grp2))...]
3-element Vector{Tuple{Char}}:
 ('e',)
 ('c',)
 ('a',)

julia> arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
3-element Vector{String}:
 "partition_e.arrow"
 "partition_c.arrow"
 "partition_a.arrow"

julia> Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
"concatenated.arrow"

julia> DataFrame(Arrow.Table("concatenated.arrow"))
69ร—2 DataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ d       -69
   2 โ”‚ d       -91
   3 โ”‚ d       -82
   4 โ”‚ d       -33
   5 โ”‚ d       -22
   6 โ”‚ d       -36
   7 โ”‚ d       -61
   8 โ”‚ d       -23
   9 โ”‚ d       -93
  10 โ”‚ d        -1
  11 โ”‚ d       -45
  12 โ”‚ b        -8
  13 โ”‚ b       -52
  14 โ”‚ b        -6
  15 โ”‚ b       -71
  16 โ”‚ b       -83
  17 โ”‚ b       -23
  18 โ”‚ b       -96
  โ‹ฎ  โ”‚  โ‹ฎ      โ‹ฎ
  52 โ”‚ c       -98
  53 โ”‚ c       -23
  54 โ”‚ c       -51
  55 โ”‚ c       -22
  56 โ”‚ c       -43
  57 โ”‚ c       -25
  58 โ”‚ c        66
  59 โ”‚ c         3
  60 โ”‚ c        89
  61 โ”‚ a       -96
  62 โ”‚ a       -83
  63 โ”‚ a       -34
  64 โ”‚ a       -41
  65 โ”‚ a        87
  66 โ”‚ a        93
  69 โ”‚ a        25
    33 rows omitted
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[2]
15ร—2 SubDataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ b        -8
   2 โ”‚ b       -52
   3 โ”‚ b        -6
   4 โ”‚ b       -71
   5 โ”‚ b       -83
   6 โ”‚ b       -23
   7 โ”‚ b       -96
   8 โ”‚ b       -86
   9 โ”‚ b       -76
  10 โ”‚ b       -79
  11 โ”‚ b       -13
  12 โ”‚ b       -91
  13 โ”‚ b       -84
  14 โ”‚ b       -22
  15 โ”‚ b       -21
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[3]
13ร—2 SubDataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ a       -96
   2 โ”‚ a       -83
   3 โ”‚ a       -34
   4 โ”‚ a       -41
   5 โ”‚ a       -96
   6 โ”‚ a       -83
   7 โ”‚ a       -34
   8 โ”‚ a       -41
   9 โ”‚ a        87
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[1]
11ร—2 SubDataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ d       -69
   2 โ”‚ d       -91
   3 โ”‚ d       -82
   4 โ”‚ d       -33
   5 โ”‚ d       -22
   6 โ”‚ d       -36
   7 โ”‚ d       -61
   8 โ”‚ d       -23
   9 โ”‚ d       -93
  10 โ”‚ d        -1
  11 โ”‚ d       -45

julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[2]
15ร—2 SubDataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ b        -8
   2 โ”‚ b       -52
   3 โ”‚ b        -6
   4 โ”‚ b       -71
   5 โ”‚ b       -83
   6 โ”‚ b       -23
   7 โ”‚ b       -96
   8 โ”‚ b       -86
   9 โ”‚ b       -76
  10 โ”‚ b       -79
  11 โ”‚ b       -13
  12 โ”‚ b       -91
  13 โ”‚ b       -84
  14 โ”‚ b       -22
  15 โ”‚ b       -21

julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[3]
13ร—2 SubDataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ a       -96
   2 โ”‚ a       -83
   3 โ”‚ a       -34
   4 โ”‚ a       -41
   5 โ”‚ a       -96
   6 โ”‚ a       -83
   7 โ”‚ a       -34
   8 โ”‚ a       -41
   9 โ”‚ a        87
  10 โ”‚ a        93
  11 โ”‚ a        68
  12 โ”‚ a        18
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[4]
23ร—2 SubDataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ c       -85
   2 โ”‚ c        -5
   3 โ”‚ c       -34
   4 โ”‚ c       -51
   5 โ”‚ c       -98
   6 โ”‚ c       -23
   7 โ”‚ c       -51
   8 โ”‚ c       -22
   9 โ”‚ c       -43
  10 โ”‚ c       -25
  11 โ”‚ c       -85
  12 โ”‚ c        -5
  13 โ”‚ c       -34
  14 โ”‚ c       -51
  15 โ”‚ c       -98
  16 โ”‚ c       -23
  17 โ”‚ c       -51
  18 โ”‚ c       -22
  19 โ”‚ c       -43
  20 โ”‚ c       -25
  21 โ”‚ c        66
  22 โ”‚ c         3
  23 โ”‚ c        89

julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[5]
7ร—2 SubDataFrame
 Row โ”‚ ID    val   
     โ”‚ Char  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ e        38
   2 โ”‚ e         8
   3 โ”‚ e        71
   4 โ”‚ e         3
   5 โ”‚ e        89
   6 โ”‚ e        41
   7 โ”‚ e        48

1 Like

Thanks so much! I always learn a great deal from your posts. Again I could be missing something obvious but I think the above solution creates an Arrow.table similar to the following.

for SubDataFrame = GroupedDataFrame1 
     Arrow.append("concatenated.arrow", SubDataFrame)
end

for SubDataFrame = GroupedDataFrame2
     Arrow.append("concatenated.arrow", SubDataFrame)
end

This saves each GroupedDataFrame as a Arrow.Table with each SubDataFrame as separate partition without having to create separate Arrow.Tables.

However this still wonโ€™t have the affect of appending each SubDataFrame in GroupedDataFrame2 to the corresponding SubDataFrame in GroupedDataFrame1 that shares the same key value. Which is what I am trying to accomplish without having to rewrite the entire concatenated.arrow file.

also I was trying to figure out a simpler way to create the ids column and I could be wrong but it seems it can be created without the [] and the splat operator? I think both values.(keys(gdf)) and [values.(keys(gdf))...] return a Vector{Tuple{String}} where

values.(keys(gdf)) == [values.(keys(gdf))...]
true

and

isequal(values.(keys(gdf)), [values.(keys(gdf))...])
true

whereas DataFrame(keys(gdf)).ID would return a Vector{String} so I think we could also do

ids = DataFrame(keys(gdf)).ID
arrow_files=["partition_$k"*".arrow" for k in ids]