Is it possible to join DataFrame with Arrow Table ensuring unique rows without bringing Arrow Table into RAM?

I’m sure there is probably a better approach I am not aware of but I am trying to add a new DataFrame, NewDF to an existing Arrow table while eliminating any rows in NewDF that already exist in the Arrow file.

Something like this

database = #filepath 
dataDF = copy(DataFrame(Arrow.Table(database)))   
dataDF = vcat(NewDF, dataDF) 
unique!(dataDF)
Arrow.write (database, dataDF)

However the Arrow file has grown too large and I am have a hard time copying the Arrow table as a DataFrame to make it mutable. (My understanding is that Arrow files are otherwise immutable.)

My questions are:

  1. Is there a better way to store and add to large data sets?
    ( Probably doing something wrong but I’m working with 64gb of ram and the Arrow file which is about 13GB often stalls or crashes my computer when I try to copy it to a DataFrame.)

  2. Is there a way to efficiently check if the rows of NewDF already exist in dataDF given
    a: dataDF is immutable
    b: the different sizes of the DataFrames
    c: the different order of the rows?
    e.g.

dataDF = DataFrame(x = ["a","b","c"], y = [1,2,3] , z = [today(), today()+Day(1), today()+Day(2)] , a =[4.0,5.0,6.0] )
3Γ—4 DataFrame
 Row β”‚ x       y      z           a       
     β”‚ String  Int64  Date        Float64 
─────┼────────────────────────────────────
   1 β”‚ a           1  2023-03-30      4.0
   2 β”‚ b           2  2023-03-31      5.0
   3 β”‚ c           3  2023-04-01      6.0     

NewDF = DataFrame(x = ["c", "d"], y = [3, 8] , z = [today()+Day(2), today()+Day(3)] , a =[6.0, 7.0] )
2Γ—4 DataFrame
 Row β”‚ x       y      z           a       
     β”‚ String  Int64  Date        Float64 
─────┼────────────────────────────────────
   1 β”‚ c           3  2023-04-01      6.0
   2 β”‚ d           8  2023-04-02      7.0

I would like the resulting DataFrame to be something like

4Γ—4 DataFrame
 Row β”‚ x       y      z           a       
     β”‚ String  Int64  Date        Float64 
─────┼────────────────────────────────────
   1 β”‚ a           1  2023-03-30      4.0
   2 β”‚ b           2  2023-03-31      5.0
   3 β”‚ c           3  2023-04-01      6.0     
   4 | d           8  2023-04-02      7.0        

I don’t think that isequal in this situation. The only way I could think of doing it was creating a completely new DataFrame with something like

function addunique(NewDF, dataDF)
    DF=DataFrame()
    for row in NewDF 
        !in(NewDF[row,:], eachrow(dataDF)) && push!(DF, NewDF[row,:])
    end
    DF
end

Is this an efficient approach? dataDF has hundreds of millions of rows so any suggestions would be greatly appreciated!

Also I know that DF[:, :cols] creates a copy of cols whereas using ! as the row selector in DF[!,:cols] returns a view. Is there an equivalent for column selection in DataFrames because it seems in the function above I would be creating a lot of copies unnecessarily?

Dear @phantom,

I’m also pretty new to Julia, however I know that there are some options for joining DataFrames together. One is mentioned on stackoverflow (link), which should work the following in your case:

outerjoin(dataDF, NewDF, on = intersect(names(dataDF), names(NewDF)))

I am not sure if this also will work for large datasets. It should take dataDF as main dataframe and not change it, the different order of rows and different sizes of Dataframes should not matter.

1 Like

@JorizovdZ Thanks so much! This was really helpful. I had no idea outerjoin could be used like that! With the outerjoin method do I encounter the same hurdle of bringing the Arrow table into RAM? It seems I would have to do so in order to save the joined DataFrame? I think I might be able to use something like Arrow.append but this creates the problem of adding potentially non-unique entries.

Also I came across this guide to transformations of DataFrames just in case you might find it helpful with other DataFrame work.
A Complete Guide to Efficient Transformations of data frames | BogumiΕ‚ KamiΕ„ski | JuliaCon 2022 - YouTube

So combining @JorizovdZ and @rocco_sprmnt21’s solution here and @bkamins solution here thus far I think the best way to combine a new DataFrame with an existing Arrow file ensuring unique values without bringing the Arrow file into memory would be something like

database = "filepath" 
Arrow.write(database ,dataDF,  file = false) # when saving the dataDF as arrow file - keyword file has to be set to false 
dataDF = DataFrame(Arrow.Table(database))   
NewDF = antijoin(NewDF, dataDF , on = intersect(names(NewDF), names(DataDF)) 
Arrow.append (database, NewDF)
1 Like