Append a `DataFrame` to a partition of an existing `ArrowTable` without creating a new `ArrowTable`?

phantom · April 14, 2023, 8:51am

Hi! suppose I have the following GroupedDataFrame

GDF1 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,10),Date(2023,4,10)], Time = [3.85, 4.13]), :ID)

With Arrow.append I can save each subdataframe of a GroupedDataFrame as a separate partition of an ArrowTable with something like

File = "filepath"
for i in GDF1
    Arrow.append(File, i)
end

I was just wondering is there a way to append to each partition after it is created? For example if I wanted to append

GDF2 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,12),Date(2023,4,12)], Time = [3.87, 4.14]), :ID)

To the created Arrow.Table. If I try

File = "filepath"
for i in GDF2
    Arrow.append(File, i)
end

I end up with an Arrow.Table that has 4 partitions and looks like

View = DataFrame(Arrow.Table(File))
4×3 DataFrame
 Row │ ID      Date        Time    
     │ String  Date        Float64 
─────┼─────────────────────────────
   1 │ Eng1    2023-04-10     3.85
   2 │ Eng2    2023-04-10     4.13
   3 │ Eng1    2023-04-12     3.87
   4 │ Eng2    2023-04-12     4.14

I would like the resulting Arrow.Table to retain the initial 2 partitions with new Data added appended to each partition. i.e. the result I would have obtained had I done

DF1 = DataFrame(GDF1)
DF2 = DataFrame(GDF2)
DF = vcat(DF1,DF2)
GDF = groupby(DF, :ID)
File = "filepath"

for i in GDF
    Arrow.append(File, i)
end

The resulting DataFrame of the Arrow.Table would look like

4×3 DataFrame
 Row │ ID      Date        Time    
     │ String  Date        Float64 
─────┼─────────────────────────────
   1 │ Eng1    2023-04-10     3.85
   2 │ Eng1    2023-04-12     3.87
   3 │ Eng2    2023-04-10     4.13
   4 │ Eng2    2023-04-12     4.14

after the inclusion of each new DataFrame.. I would like to accomplish this by appending to each existing partition because new data is added daily to a very large Arrow.Table. Thus I am unable to use the previous method and calling sort to the entire DataFrame would be very slow or exceed the memory capacity of my machine.

Based on bkamins explanation of `Arrow.Stream’ I can try something like

Arrow stream usage clarification

MassivePartionedTable_Input = Arrow.Stream("inputFile.arrow")
for eachPartition in MassivePartionedTable_Input
  df_eachPartition  = DataFrame(eachPartition) 
  largerTableOuput   = DoesSomeThingOnThisPartition( df_eachPartition )
  Arrow.append("outputFile.arrow", largerTableOutput)
end

where DoesSomeThingOnThisPartition would be to vcat any new Data generated for the existing :ID partition.

However I think this would require that I create a new Arrow.Table in a new file rather than appending to the existing Table. I think this would be very tedious and quite storage intensive because it would have to be done daily and the file is quite large.

(I could alternately delete and create Input and output files but this would still be inefficient because even the many :ID partitions that had not updated on a given day would have to be re-copied.) So just wondering if there was a smarter way to accomplish this? Any insights would be greatly appreciated. Thank you so much!

rocco_sprmnt21 · April 14, 2023, 9:57pm

waiting for a more specific solution, you could adapt this scheme to your case


function appendgroup(grp)
    for (i,g) in enumerate(grp)
        k=last(keys(grp)[i])
        Arrow.append("partition_$k"*".arrow", g)
    end
end

If you want to put it all together


arrow_files=["partition_$k"*".arrow" for k in last.(keys(grp))]
Arrow.write("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))

PS
I tried to append a file to a file created with Arrow.write, but failed despite following the instructions below

Append any Tables.jl-compatible tbl to an existing arrow formatted file or IO. The existing arrow data must be in IPC stream format. Note that appending to the “feather formatted file” is not allowed, as this file format doesn’t support appending. That means files written like Arrow.write(filename::String, tbl) cannot be appended to; instead, you should write like Arrow.write(filename::String, tbl; file=false).

phantom · April 15, 2023, 3:51am

Hey thanks so much for pointing this out! So with the above function, am I writing each SubDataFrame of a GroupedDataFrame to a separate Arrow.Table and then concatenating the individual tables? Because then I would still have to somehow match the New Data with the existing Arrow.Table by combining the entries and deleting the redundancies. Which I don’t know if its possible without bringing the Arrow.Table into memory?

In general is it better to have a bunch of small Arrow.Tables as opposed to one massive one?

What I would like to end up with is something like (clunkifying bkamins’ code above)

MassivePartionedTable_Input = Arrow.Stream("inputFile.arrow")
vehicle = unique(DataFrame(GDF2).ID) 
for eachPartition in MassivePartionedTable_Input
    df_eachPartition  = DataFrame(eachPartition) 
    id = df_eachPartition.ID[1] 
    if id in(vehicle)
        newPartition = vcat(GDF2[(id,)], df_eachPartition)
        Arrow.append("outputFile.arrow", newPartition)
        setdiff!(vehicle, [id]) 
    else 
        Arrow.append("outputFile.arrow", df_eachPartition)
    end
end

but ideally “inputFile” could be the “outputFile” so I could append to the existing table as opposed to constantly creating a new one via individual partition.

That’s strange. I am using Arrow v2.5.0 with DataFrames v1.5.0 and got the following to work

Arrow.write("filepath",DF1; file = false)
Arrow.append("filepath",DF2)

Where DF1 and DF2 are DataFrames of matching schema. But with a GroupedDataFrame I had to use

for g = GDF1
arrow.append("filepath",g)
end

for g = GDF2 
arrow.append("filepath",g)
end

to append because I guess Arrow.Table squishes the entries of each SubDataFrame into an array so the schema between GDF1 and GDF2 are mismatched.

rocco_sprmnt21 · April 15, 2023, 7:55am

I’m not sure I understood what the basic need is, (nor do I have experience using the arrow package.
I’m just trying to learn something while we discuss the case,) but I’ll try to explain better with an example what I meant in the brief answer above.

julia> function appendgroup(grp)
           for (i,g) in enumerate(grp)
               k=join(values(keys(grp)[i]),'_')
               Arrow.append("partition_$k"*".arrow", g)
           end
       end
appendgroup (generic function with 1 method)

julia> data1 = [(ID=rand('a':'d'), val=rand(-100:0)) for _ in 1:40];

julia> df1=DataFrame(data1);

julia> grp1=groupby(df1,:ID);

julia> appendgroup(grp1)

julia> DataFrame(Arrow.Table("partition_d.arrow"))
9×2 DataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ d       -88
   2 │ d       -31
   3 │ d        -8
   4 │ d       -91
   5 │ d       -56
   6 │ d       -24
   7 │ d       -93
   8 │ d       -19
   9 │ d        -1

julia> data2 = [(ID=rand('b':'e'), val=rand(1:100)) for _ in 1:20];

julia> df2=DataFrame(data2);

julia> grp2=groupby(df2,:ID);

julia> appendgroup(grp2)

julia> DataFrame(Arrow.Table("partition_a.arrow"))
8×2 DataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ a       -63
   2 │ a       -98
   3 │ a        -3
   4 │ a        -6
   5 │ a       -40
   6 │ a       -84
   7 │ a       -49
   8 │ a       -57

julia> DataFrame(Arrow.Table("partition_d.arrow"))
11×2 DataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ d       -88
   2 │ d       -31
   3 │ d        -8
   4 │ d       -91
   5 │ d       -56
   6 │ d       -24
   7 │ d       -93
   8 │ d       -19
   9 │ d        -1
  10 │ d        71
  11 │ d        42

julia> DataFrame(Arrow.Table("partition_e.arrow"))
8×2 DataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ e        95
   2 │ e        18
   3 │ e        12
   4 │ e        12
   5 │ e        51
   6 │ e         5
   7 │ e        18
   8 │ e        57

julia> ids=Set([values.(keys(grp1))...,values.(keys(grp2))...])    
Set{Tuple{Char}} with 5 elements:
  ('a',)
  ('e',)
  ('c',)
  ('d',)
  ('b',)

julia> arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
5-element Vector{String}:
 "partition_a.arrow"
 "partition_e.arrow"
 "partition_c.arrow"
 "partition_d.arrow"
 "partition_b.arrow"

julia> Arrow.write("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
"concatenated.arrow"

I found here some references to the functions used. In particular, with regard to the last one, I carry a comment from the author:

Note that this “functional” form of Tables.partitioner applies the mapping function lazily as each partition is processed, to avoid having to load all the arrow tables at once in memory.

concatenated_arrow

julia> DataFrame(Arrow.Table("concatenated.arrow"))
60×2 DataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ a       -63
   2 │ a       -98
   3 │ a        -3
   4 │ a        -6
   5 │ a       -40
   6 │ a       -84
   7 │ a       -49
   8 │ a       -57
   9 │ e        95
  10 │ e        18
  11 │ e        12
  12 │ e        12
  13 │ e        51
  14 │ e         5
  15 │ e        18
  16 │ e        57
  17 │ c       -82
  18 │ c       -86
  ⋮  │  ⋮      ⋮
  43 │ b       -39
  44 │ b       -92
  45 │ b        -1
  46 │ b       -95
  47 │ b       -87
  48 │ b       -75
  49 │ b       -99
  50 │ b       -39
  51 │ b       -31
  52 │ b       -32
  53 │ b       -96
  54 │ b        95
  55 │ b        76
  56 │ b        48
  57 │ b        64
  58 │ b        13
  59 │ b        87
  60 │ b        69
    24 rows omitted

phantom · April 18, 2023, 5:46am

Thanks so much for taking the time to clarify the issue despite my lack of clarity in stating the use case. It’s really helpful in sorting out my thinking. Borrowing from your example above, what I mean is suppose I already have concatenated.arrow saved as an the Arrow.Table with the given partitions a, b, c, d, and ,e. (i.e. appendgroup() has already been run on grp1) Now I would like to add additional data directly to partitions a,c and e as opposed to appending the data to the end of concatenated.arrow and then sorting the entire Arrow.Table. Just wondering what the best work around for that would be if I did not want to constantly create and save new Arrow.Table as per bkamins example above?

rocco_sprmnt21 · April 19, 2023, 8:44am


data1 = [(ID=rand('a':'d'), val=rand(-100:0)) for _ in 1:40];
df1=DataFrame(data1);
grp1=groupby(df1,:ID);
appendgroup(grp1)

ids=[values.(keys(grp1))...]

arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))

While waiting for someone who knows more about how these packages work, I’m submitting another thought to you.
Assuming it is possible to start from a situation like the previuos where you have an arrow file obtained from the following expression
Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
(if what you have is not like that, maybe you could create it once and for all, starting from the file you have by loading it into memory by making the groups (grps) and saving them with appendgrp(grps) and then using the expression.
Having this starting situation, the append you ask for can be done like this


data2 = [(ID=rand(['a','c','e']), val=rand(1:100)) for _ in 1:15]
grp2=groupby(DataFrame(data2),:ID)
appendgroup(grp2)


ids=[values.(keys(grp2))...]

arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))

History Log

julia> using Arrow, DataFrames

julia> using Tables

julia> function appendgroup(grp)
           for (i,g) in enumerate(grp)
               k=join(values(keys(grp)[i]),'_')
               Arrow.append("partition_$k"*".arrow", g)
           end
       end
appendgroup (generic function with 1 method)

julia> data1 = [(ID=rand('a':'d'), val=rand(-100:0)) for _ in 1:40];

julia> df1=DataFrame(data1);

julia> grp1=groupby(df1,:ID);

julia> appendgroup(grp1)

julia> ids=[values.(keys(grp1))...]
4-element Vector{Tuple{Char}}:
 ('d',)
 ('b',)
 ('a',)
 ('c',)

julia> arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
4-element Vector{String}:
 "partition_d.arrow"
 "partition_b.arrow"
 "partition_a.arrow"
 "partition_c.arrow"

julia> Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
"concatenated.arrow"

julia> data2 = [(ID=rand(['a','c','e']), val=rand(1:100)) for _ in 1:15]
15-element Vector{NamedTuple{(:ID, :val), Tuple{Char, Int64}}}:
 (ID = 'e', val = 38)
 (ID = 'c', val = 66)
 (ID = 'e', val = 8)
 (ID = 'a', val = 87)
 (ID = 'e', val = 71)
 (ID = 'a', val = 93)
 (ID = 'e', val = 3)
 (ID = 'a', val = 68)
 (ID = 'c', val = 3)
 (ID = 'e', val = 89)
 (ID = 'e', val = 41)
 (ID = 'c', val = 89)
 (ID = 'e', val = 48)
 (ID = 'a', val = 18)
 (ID = 'a', val = 25)

julia> grp2=groupby(DataFrame(data2),:ID)
GroupedDataFrame with 3 groups based on key: ID
First Group (7 rows): ID = 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ e        38
   2 │ e         8
   3 │ e        71
   4 │ e         3
   5 │ e        89
   6 │ e        41
   7 │ e        48
⋮
Last Group (5 rows): ID = 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ a        87
   2 │ a        93
   3 │ a        68
   4 │ a        18
   5 │ a        25

julia> appendgroup(grp2)

julia> ids=[values.(keys(grp2))...]
3-element Vector{Tuple{Char}}:
 ('e',)
 ('c',)
 ('a',)

julia> arrow_files=["partition_$k"*".arrow" for k in join.(ids,'_')]
3-element Vector{String}:
 "partition_e.arrow"
 "partition_c.arrow"
 "partition_a.arrow"

julia> Arrow.append("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files))
"concatenated.arrow"

julia> DataFrame(Arrow.Table("concatenated.arrow"))
69×2 DataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ d       -69
   2 │ d       -91
   3 │ d       -82
   4 │ d       -33
   5 │ d       -22
   6 │ d       -36
   7 │ d       -61
   8 │ d       -23
   9 │ d       -93
  10 │ d        -1
  11 │ d       -45
  12 │ b        -8
  13 │ b       -52
  14 │ b        -6
  15 │ b       -71
  16 │ b       -83
  17 │ b       -23
  18 │ b       -96
  ⋮  │  ⋮      ⋮
  52 │ c       -98
  53 │ c       -23
  54 │ c       -51
  55 │ c       -22
  56 │ c       -43
  57 │ c       -25
  58 │ c        66
  59 │ c         3
  60 │ c        89
  61 │ a       -96
  62 │ a       -83
  63 │ a       -34
  64 │ a       -41
  65 │ a        87
  66 │ a        93
  69 │ a        25
    33 rows omitted
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[2]
15×2 SubDataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ b        -8
   2 │ b       -52
   3 │ b        -6
   4 │ b       -71
   5 │ b       -83
   6 │ b       -23
   7 │ b       -96
   8 │ b       -86
   9 │ b       -76
  10 │ b       -79
  11 │ b       -13
  12 │ b       -91
  13 │ b       -84
  14 │ b       -22
  15 │ b       -21
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[3]
13×2 SubDataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ a       -96
   2 │ a       -83
   3 │ a       -34
   4 │ a       -41
   5 │ a       -96
   6 │ a       -83
   7 │ a       -34
   8 │ a       -41
   9 │ a        87
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[1]
11×2 SubDataFrame
 Row │ ID    val   
     │ Char  Int64 
─────┼─────────────
   1 │ d       -69
   2 │ d       -91
   3 │ d       -82
   4 │ d       -33
   5 │ d       -22
   6 │ d       -36
   7 │ d       -61
   8 │ d       -23
   9 │ d       -93
  10 │ d        -1
  11 │ d       -45

julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[2]
15×2 SubDataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ b        -8
   2 │ b       -52
   3 │ b        -6
   4 │ b       -71
   5 │ b       -83
   6 │ b       -23
   7 │ b       -96
   8 │ b       -86
   9 │ b       -76
  10 │ b       -79
  11 │ b       -13
  12 │ b       -91
  13 │ b       -84
  14 │ b       -22
  15 │ b       -21

julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[3]
13×2 SubDataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ a       -96
   2 │ a       -83
   3 │ a       -34
   4 │ a       -41
   5 │ a       -96
   6 │ a       -83
   7 │ a       -34
   8 │ a       -41
   9 │ a        87
  10 │ a        93
  11 │ a        68
  12 │ a        18
julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[4]
23×2 SubDataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ c       -85
   2 │ c        -5
   3 │ c       -34
   4 │ c       -51
   5 │ c       -98
   6 │ c       -23
   7 │ c       -51
   8 │ c       -22
   9 │ c       -43
  10 │ c       -25
  11 │ c       -85
  12 │ c        -5
  13 │ c       -34
  14 │ c       -51
  15 │ c       -98
  16 │ c       -23
  17 │ c       -51
  18 │ c       -22
  19 │ c       -43
  20 │ c       -25
  21 │ c        66
  22 │ c         3
  23 │ c        89

julia> groupby(DataFrame(Arrow.Table("concatenated.arrow")),:ID)[5]
7×2 SubDataFrame
 Row │ ID    val   
     │ Char  Int64
─────┼─────────────
   1 │ e        38
   2 │ e         8
   3 │ e        71
   4 │ e         3
   5 │ e        89
   6 │ e        41
   7 │ e        48

phantom · April 21, 2023, 1:24am

Thanks so much! I always learn a great deal from your posts. Again I could be missing something obvious but I think the above solution creates an Arrow.table similar to the following.

for SubDataFrame = GroupedDataFrame1 
     Arrow.append("concatenated.arrow", SubDataFrame)
end

for SubDataFrame = GroupedDataFrame2
     Arrow.append("concatenated.arrow", SubDataFrame)
end

This saves each GroupedDataFrame as a Arrow.Table with each SubDataFrame as separate partition without having to create separate Arrow.Tables.

However this still won’t have the affect of appending each SubDataFrame in GroupedDataFrame2 to the corresponding SubDataFrame in GroupedDataFrame1 that shares the same key value. Which is what I am trying to accomplish without having to rewrite the entire concatenated.arrow file.

phantom · April 21, 2023, 1:28am

also I was trying to figure out a simpler way to create the ids column and I could be wrong but it seems it can be created without the [] and the splat operator? I think both values.(keys(gdf)) and [values.(keys(gdf))...] return a Vector{Tuple{String}} where

values.(keys(gdf)) == [values.(keys(gdf))...]
true

and

isequal(values.(keys(gdf)), [values.(keys(gdf))...])
true

whereas DataFrame(keys(gdf)).ID would return a Vector{String} so I think we could also do

ids = DataFrame(keys(gdf)).ID
arrow_files=["partition_$k"*".arrow" for k in ids]

Topic		Replies	Views
Arrow stream usage clarification Data dataframes , arrow	10	1563	July 17, 2023
Is it possible to join DataFrame with Arrow Table ensuring unique rows without bringing Arrow Table into RAM? New to Julia question , dataframes , arrow	3	652	April 3, 2023
Write data to Arrow file row by row General Usage arrow	7	1773	April 7, 2023
Appending rows to a dataframe is seemingly inconsistent and confusing Data	11	4718	December 24, 2021
How well Apache Arrow’s zero copy methodology is supported? Data arrow	24	2664	May 1, 2021

Append a `DataFrame` to a partition of an existing `ArrowTable` without creating a new `ArrowTable`?

Related topics