I have an array/vector of data frames where each data frame has columns site
and time
and some others with common names. How can I filter the data frames such that each data frame has exactly the same rows for site
and time
? Are there any simple solutions?
I think youβre looking for innerjoin
help?> innerjoin
search: innerjoin
innerjoin(df1, df2; on, makeunique=false, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error,
order=:undefined)
innerjoin(df1, df2, dfs...; on, makeunique=false,
validate=(false, false), matchmissing=:error,
order=:undefined)
Perform an inner join of two or more data frame objects and return a DataFrame containing the result. An inner join includes rows with keys that match in all passed data frames.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1. This behavior may change in future releases.
Yes, but then I end up with a single data frame, I presume. Ideally I would like to stick to an array of data frames.
I think you want to apply semijoin
to each of the data frames. where the right data frame would contain innerjoin
of key columns of all data frames.
In case I understand the problem correctly, the following might be a solution:
julia> using DataFrames, Random
julia> dfvec = [DataFrame(site = sample(1:9,7; replace=false),
time = rand(1:1000,7),
other = [randstring(5) for _ in 1:7]) for _ in 1:5]
5-element Vector{DataFrame}:
7Γ3 DataFrame
Row β site time other
β Int64 Int64 String
ββββββΌββββββββββββββββββββββ
1 β 1 979 N3b4d
2 β 4 484 rdzmh
3 β 8 572 PppG5
4 β 2 412 njGc2
...
julia> intersect([df.site for df in dfvec]...)
4-element Vector{Int64}:
1
4
8
3
julia> [leftjoin(DataFrame(site=intersect([df.site for df in dfvec]...)), df; on=:site) for df in dfvec]
5-element Vector{DataFrame}:
4Γ3 DataFrame
Row β site time other
β Int64 Int64? String?
ββββββΌββββββββββββββββββββββββ
1 β 1 979 N3b4d
2 β 4 484 rdzmh
3 β 8 572 PppG5
4 β 3 345 5wHyL
...
There might be more efficient solutions.
Thanks for the suggestions. Here is how I implemented @bkaminsβ appealing proposal assuming data
is the array of data frames
st = [d[:, [:site, :time]] for d in data]
st = innerjoin(st...; on = names(st[1]))
for i in eachindex(data)
data[i] = semijoin(data[i], st; on = names(st))
end
dfvec = [DataFrame(site = rand(1:5,20),
time = rand(1:5,20),
other = [randstring(5) for _ in 1:20],
dfi=id) for id in 1:3]
grp=subset(groupby(vcat(dfvec...),[:site,:time]),:dfi=>g->length(unique(g))==length(dfvec), ungroup=true)
dfg=groupby(grp, :dfi)
# the following expression is not strictly necessary, as you can access individual subdataframes via dfg[index]
[DataFrame(g) for g in dfg]
a 6X (?? not sure of the measure) faster alternative on a vector of the following type
julia> dfvec = [DataFrame(site = rand(1:5*10^2,10^5),
time = rand(1:5*10^2,10^5),
other = [randstring(5) for _ in 1:10^5],
) for id in 1:10]
10-element Vector{DataFrame}:
100000Γ3 DataFrame
...
function func21(dfvec)
sort!.(dfvec)
comm=intersect([zip(df.site,df.time) for df in dfvec]...)
[let tw=collect(zip(dfvec[i].site,dfvec[i].time)); @view dfvec[i][reduce(vcat,searchsorted(tw, c) for c in comm),:] end for i in eachindex(dfvec)]
end
PS
I wonder if searchsorted also worked with zips, without the need to do a collect, would it be faster !?
Thanks for looking into this and proposing alternative solutions. I must admit I have a preference for @bkamins approach, though.
Below are some details and a benchmark on my original data which is a bit more complicated than first described (unequal number of columns and four key-columns).
julia> map(size, data)
4-element Vector{Tuple{Int64, Int64}}:
(1281165, 10)
(1267162, 10)
(1251500, 210)
(1265222, 66)
julia> @time begin
st = [d[:, [:site, :time, :time_ref, :lt]] for d in data]
st = innerjoin(st...; on = names(st[1]))
for i in eachindex(data)
data[i] = semijoin(data[i], st; on = names(st))
end
end
7.062781 seconds (67.73 k allocations: 4.678 GiB, 5.83% gc time, 0.59% compilation time)
Totally okay with using the DataFrames functions.
I just tried to figure out some algorithm that would approach the performance of DataFrames.
But, as experienced on other occasions, it is not easy to compete with DataFrames even in specific situations where the particularities of the case can potentially be exploited (something that a generalist package cannot always do).
A side comment to this is that while Juliaβs for loops are efficient, proper use of expert-written functions is preferable.