Is it possible to join two dataframes that returns a view of the original two?

In julia programming, is it possible to join two dataframes, where the joined dataframe is essentially a “view” of the original two?
For example,

  julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
  3×2 DataFrame
   Row │ ID     Name
       │ Int64  String
  ─────┼──────────────────
     1 │     1  John Doe
     2 │     2  Jane Doe
     3 │     3  Joe Blogs
  julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
  3×2 DataFrame
   Row │ ID     Job
       │ Int64  String
  ─────┼───────────────
     1 │     1  Lawyer
     2 │     2  Doctor
     3 │     4  Farmer
  
  julia> df_join = leftjoin(name, job, on = :ID)
  3×3 DataFrame
   Row │ ID     Name       Job
       │ Int64  String     String?
  ─────┼───────────────────────────
     1 │     1  John Doe   Lawyer
     2 │     2  Jane Doe   Doctor
     3 │     3  Joe Blogs  missing

such that

julia> df_join.Name[1] = "Foo Bar";
julia> df_join.ID[1] = 999;
julia> name.ID[1]
999
julia> job.ID[1]
999
julia> name.Name[1]
"Foo Bar"

name, job, and df_join all have separate Vector{Int64} for the ID column, they won’t share mutations. Even if leftjoin was able to make name and df_join share the ID vector, job still can’t share it because you provided a separate vector with different values in the first place.

You need to change types to pull off what you want. For example, you could make a Ref.([1, 2, 3]) for name’s ID column and very carefully reuse those elements in all subsequent Vector{Ref{Int64}}. The vectors still won’t be shared so you don’t mutate those, but you mutate the Ref element.

I don’t recommend doing this everywhere because linking elements like this gets really messy. You could change a value that conflicts with another row, and that’s much easier to accidentally do when you’re mutating something in several tables at once. If you’re just mutating Ref instances, separate instances can end up referencing the same value, so you can’t be sure one Ref(999) is the same as the other Ref(999) when printed. I guess you could check a cache and mutate the vector’s element to an existing referenced value instead of mutating the reference every time, but that’s complicated to pull off and track too, and irreversibly decreasing the number of unique elements like that may not even be desirable. Making many Ref elements also hurts performance in general because of the lack of data locality in memory.

2 Likes