Filtering/joining multiple data frames with respect to selected keys

johnbb · July 3, 2023, 7:56am

I have an array/vector of data frames where each data frame has columns site and time and some others with common names. How can I filter the data frames such that each data frame has exactly the same rows for site and time? Are there any simple solutions?

nilshg · July 3, 2023, 8:10am

I think you’re looking for innerjoin

help?> innerjoin
search: innerjoin

  innerjoin(df1, df2; on, makeunique=false, validate=(false, false),
            renamecols=(identity => identity), matchmissing=:error,
            order=:undefined)
  innerjoin(df1, df2, dfs...; on, makeunique=false,
            validate=(false, false), matchmissing=:error,
            order=:undefined)


  Perform an inner join of two or more data frame objects and return a DataFrame containing the result. An inner join includes rows with keys that match in all passed data frames.

  In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1. This behavior may change in future releases.

johnbb · July 3, 2023, 8:55am

Yes, but then I end up with a single data frame, I presume. Ideally I would like to stick to an array of data frames.

bkamins · July 3, 2023, 9:11am

I think you want to apply semijoin to each of the data frames. where the right data frame would contain innerjoin of key columns of all data frames.

Dan · July 3, 2023, 9:43am

In case I understand the problem correctly, the following might be a solution:

julia> using DataFrames, Random

julia> dfvec = [DataFrame(site = sample(1:9,7; replace=false),
                          time = rand(1:1000,7),
                          other = [randstring(5) for _ in 1:7]) for _ in 1:5]
5-element Vector{DataFrame}:
 7×3 DataFrame
 Row │ site   time   other  
     │ Int64  Int64  String 
─────┼──────────────────────
   1 │     1    979  N3b4d
   2 │     4    484  rdzmh
   3 │     8    572  PppG5
   4 │     2    412  njGc2
...

julia> intersect([df.site for df in dfvec]...)
4-element Vector{Int64}:
 1
 4
 8
 3

julia> [leftjoin(DataFrame(site=intersect([df.site for df in dfvec]...)), df; on=:site) for df in dfvec]
5-element Vector{DataFrame}:
 4×3 DataFrame
 Row │ site   time    other   
     │ Int64  Int64?  String? 
─────┼────────────────────────
   1 │     1     979  N3b4d
   2 │     4     484  rdzmh
   3 │     8     572  PppG5
   4 │     3     345  5wHyL
...

There might be more efficient solutions.

johnbb · July 3, 2023, 10:19am

Thanks for the suggestions. Here is how I implemented @bkamins’ appealing proposal assuming data is the array of data frames

st = [d[:, [:site, :time]] for d in data]
st = innerjoin(st...; on = names(st[1]))
for i in eachindex(data)
    data[i] = semijoin(data[i], st; on = names(st))
end

rocco_sprmnt21 · July 4, 2023, 11:43am

dfvec = [DataFrame(site = rand(1:5,20),
                          time = rand(1:5,20),
                          other = [randstring(5) for _ in 1:20],
                          dfi=id) for id in 1:3]


grp=subset(groupby(vcat(dfvec...),[:site,:time]),:dfi=>g->length(unique(g))==length(dfvec), ungroup=true)
dfg=groupby(grp, :dfi)
# the following expression is not strictly necessary, as you can access individual subdataframes via dfg[index]


[DataFrame(g) for g in dfg]

rocco_sprmnt21 · July 5, 2023, 1:29pm

a 6X (?? not sure of the measure) faster alternative on a vector of the following type

julia> dfvec = [DataFrame(site = rand(1:5*10^2,10^5),
                                 time = rand(1:5*10^2,10^5),
                                 other = [randstring(5) for _ in 1:10^5],
                                 ) for id in 1:10]
10-element Vector{DataFrame}:
 100000×3 DataFrame
...

function func21(dfvec)
    sort!.(dfvec)
    comm=intersect([zip(df.site,df.time) for df in dfvec]...)
    [let tw=collect(zip(dfvec[i].site,dfvec[i].time)); @view dfvec[i][reduce(vcat,searchsorted(tw, c) for c in comm),:] end for i in eachindex(dfvec)]
end

PS
I wonder if searchsorted also worked with zips, without the need to do a collect, would it be faster !?

johnbb · July 5, 2023, 2:03pm

Thanks for looking into this and proposing alternative solutions. I must admit I have a preference for @bkamins approach, though.

Below are some details and a benchmark on my original data which is a bit more complicated than first described (unequal number of columns and four key-columns).

julia> map(size, data)
4-element Vector{Tuple{Int64, Int64}}:
 (1281165, 10)
 (1267162, 10)
 (1251500, 210)
 (1265222, 66)

julia> @time begin
              st = [d[:, [:site, :time, :time_ref, :lt]] for d in data]
              st = innerjoin(st...; on = names(st[1]))
              for i in eachindex(data)
                  data[i] = semijoin(data[i], st; on = names(st))
              end
              end
  7.062781 seconds (67.73 k allocations: 4.678 GiB, 5.83% gc time, 0.59% compilation time)

rocco_sprmnt21 · July 6, 2023, 4:52pm

Totally okay with using the DataFrames functions.
I just tried to figure out some algorithm that would approach the performance of DataFrames.
But, as experienced on other occasions, it is not easy to compete with DataFrames even in specific situations where the particularities of the case can potentially be exploited (something that a generalist package cannot always do).
A side comment to this is that while Julia’s for loops are efficient, proper use of expert-written functions is preferable.