For example, I’m reading medical data from an Excel table stored in a DataFrame named df. Each row corresponds to a patient. It would seem logical that I should be able to access a cell (say age) simply by typing df[patient_id, age].
For the moment, the solution I’ve come up with is to do
Which is incredibly complicated for this simple operation.
Maybe it’s just me who doesn’t know the right way to do it, in which case I’ll be grateful to anyone who does!
I have wanted a simpler way to do this in the past too. The short versions are somewhat hard to discover on your own.
==(value) is a shorthand for x -> x == value.
. syntax retrieves the column you are interested in from the filtered data frame.
(df.age is equivalent to df[!, :age].)
only checks that there is only one value returned and returns it.
This is needed because you might have duplicate data or have passed a function like <(value) that would return multiple rows. I like to pipe |> to the only function, but you can also just wrap everything inside it only(filter(...)).
alfaromartino’s solution is shorter, but I dislike having to write df twice.
(You will still probably want to pass the output to only.)
You may also prefer the DataFramesMeta.jl syntax.
Here the macro @rsubset will automatically transform the written code into the subset call shown above. It eliminates the need to write => and ByRow explicitly.
julia> using DataFramesMeta
julia> @rsubset(df, :patient_id == value).age |> only
28
That would allocate a vector to make the comparison.
I would consider (for nice, readable syntax):
patient_row(df, id) = findfirst(r -> r.patient_id == id, eachrow(df))
# or `only(findall())` if one wants to ensure that the id is unique
# produes a DataFrameRow if `cols` is Vector
get_patient(df, id, cols) = df[patient_row(df, id), cols]
(assuming each patient corresponds to exactly one row)
I think that OP wasn’t asking about the most efficient way, but the simplest approach. If you want to keep your code simple, my recommendation is the following:
df = DataFrame(id = 1:5, weight= 56:60, age = 26:30)
df[df.id .== 1, :age] # for returning a value/vector
df[df.id .== 1, [:age]] # for returning a df
df[in(1:2).(df.id), :age] # for selecting multiple ids
df[(df.id .== 1) .&& (df.weight .==56), :age] # for multiple conditions
Indeed row-lookup is an important use case. And as it was already commented here it is not easy to design a good API. The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple).
I want to add something more user friendly in 1.7 release. But we need to discuss how you would want to achieve this. The discussion is in:
I assume that your use case is for situation that you expect an exactly one match, so essentially, what you want is a shorter way of writing only(filter(:id => ==(3), df).age)) (or some alternative syntax already available that was mentioned above). Is this correct?
Can you please comment in the linked issue, so that we can move forward with the decisions?
An important use case I often need: if a table has an index (key column), this defines a mapping between it and any other column, which can then be applied to other tables to add columns according to those mappings. A simple and efficient API for that would be nice (if it does not exist already).