Can I index a DataFrame using a String key?

Hi all!

So I’m writing some code where I need to access some properties of some machine register. I want to store both the value the register it’s initializated to and its binary weight. Using a DataFrame and the count_ones function, I could produce this:

8×3 DataFrame
 Row │ Register  Value      Weight 
     │ Any       Any        Any    
─────┼─────────────────────────────
   1 │ r0        68         2
   2 │ r1        0          0
   3 │ r2        136        2
   4 │ r3        51         4
   5 │ r4        170        4
   6 │ r5        0          0
   7 │ r7        -1         1
   8 │ r8        134217983  9

However, I’d like now to access the data I just populated, hopefully with something like

weight = df["r2", :Weight]

but IIUC, DataFrames don’t allow that. Is there a way I can do this? I know Dicts could be indexed using a String, but in that case I couldn’t be able to store multiple data (columns) for each register, right? Is there a way I can access a table indexing it using Strings?

Thanks!

#Get a boolean vector that's true where the string matches
r2_indices = df[:, :Register] .== "r2"
#use that to index the dataframe 
df[r2_indices, :Weight]

I’m sure there is a cleaner syntax for this with select(), but this should get the job done!

1 Like
r2_df = groupby(df, :Register)["r2"]
select(r2_df, :Weight)

Got it!

If you want a single lookup then Boolean mask it the standard approach to do it. If you want multiple lookups then use the groupby approach. A small comment is that you need to write:

groupby(df, :Register)[("r2",)]
2 Likes

Thank you for the catch! You’ve been cleaning up my messes all morning. :rofl:

Thanks :smiley:! But I don’t understand @bkamins , do you mean that using groupby has an initial heavier overhead, and then if I need only a single lookup then the boolean vector approach is faster? Because otherwise I don’t see why would the vector be “simpler”, if that’s what you meant

This is what I meant:

  1. if you use groupby more work is done initially, to allow for O(1) queries later (and there can be many).
  2. if you use a bitmask the cost is O(n) but it is lower than doing groupby.
1 Like

Thanks! As I thought then.

As a side note, I found out a Dict with a NamedTuple also works:

julia> reg_dict = Dict();

julia> reg_dict["r0"] = (value = 0x44, weight = count_ones(0x44))
(value = 0x44, weight = 2)

julia> reg_dict["r1"] = (value = 0, weight = count_ones(0))
(value = 0, weight = 0)

julia> reg_dict["r2"] = (value = 0x88, weight = count_ones(0x88))
(value = 0x88, weight = 2)

julia> reg_dict["r3"] = (value = 0x33, weight = count_ones(0x33))
(value = 0x33, weight = 4)

julia> reg_dict
Dict{Any, Any} with 4 entries:
  "r1" => (value = 0, weight = 0)
  "r2" => (value = 0x88, weight = 2)
  "r0" => (value = 0x44, weight = 2)
  "r3" => (value = 0x33, weight = 4)

julia> reg_dict["r3"].value
0x33

julia> reg_dict["r3"].weight
4

But it gets exported to CSV with the tuple converted to a String, and moreover I suspect the performance is close if not equal to the groupbyed DataFrame, thus making the latter preferable. Is this correct?