Why is it so complicated to access a row in a DataFrame?

Hello,
This question has already been asked several times (here for example How to set a index in Julia's DataFrame? - Stack Overflow) but I still don’t understand why DataFrame.jl doesn’t have an index_col parameter like there is in Pandas (pandas.read_csv — pandas 2.0.3 documentation). Is it because of the internal design of the package?

For example, I’m reading medical data from an Excel table stored in a DataFrame named df. Each row corresponds to a patient. It would seem logical that I should be able to access a cell (say age) simply by typing df[patient_id, age].

For the moment, the solution I’ve come up with is to do

patient_row = filter(row -> row.patient_id == patient_id, df)
patient_rowt[!, ["age"]]

Which is incredibly complicated for this simple operation.
Maybe it’s just me who doesn’t know the right way to do it, in which case I’ll be grateful to anyone who does!

Thanks !
fdekerm

1 Like

You can simply right
df[df.patient_id .== 1234, :age]

2 Likes

I have wanted a simpler way to do this in the past too. The short versions are somewhat hard to discover on your own.

  • ==(value) is a shorthand for x -> x == value.
  • . syntax retrieves the column you are interested in from the filtered data frame.
    (df.age is equivalent to df[!, :age].)
  • only checks that there is only one value returned and returns it.
    This is needed because you might have duplicate data or have passed a function like <(value) that would return multiple rows. I like to pipe |> to the only function, but you can also just wrap everything inside it only(filter(...)).
julia> df = DataFrame(patient_id = 1:5, age = 26:30)
5×2 DataFrame
 Row │ patient_id  age
     │ Int64       Int64
─────┼───────────────────
   1 │          1     26
   2 │          2     27
   3 │          3     28
   4 │          4     29
   5 │          5     30

julia> value = 3
3

julia> filter(:patient_id => ==(value), df).age |> only
28

julia> subset(df, :patient_id => ByRow(==(value))).age |> only
28

alfaromartino’s solution is shorter, but I dislike having to write df twice.
(You will still probably want to pass the output to only.)

You may also prefer the DataFramesMeta.jl syntax.
Here the macro @rsubset will automatically transform the written code into the subset call shown above. It eliminates the need to write => and ByRow explicitly.

julia> using DataFramesMeta

julia> @rsubset(df, :patient_id == value).age |> only
28
2 Likes

That would allocate a vector to make the comparison.

I would consider (for nice, readable syntax):

patient_row(df, id) = findfirst(r -> r.patient_id == id, eachrow(df))
# or `only(findall())` if one wants to ensure that the id is unique

# produes a DataFrameRow if `cols` is Vector
get_patient(df, id, cols) = df[patient_row(df, id), cols] 

(assuming each patient corresponds to exactly one row)

3 posts were split to a new topic: Performance of eachrow(::DataFrame)

I think that OP wasn’t asking about the most efficient way, but the simplest approach. If you want to keep your code simple, my recommendation is the following:

df = DataFrame(id = 1:5, weight= 56:60, age = 26:30)


df[df.id .== 1, :age] # for returning a value/vector

df[df.id .== 1, [:age]]  # for returning a df

df[in(1:2).(df.id), :age] # for selecting multiple ids

df[(df.id .== 1) .&& (df.weight .==56), :age] # for multiple conditions
3 Likes

This will always return a vector:

julia> df = DataFrame(id = 1:5, weight= 56:60, age = 26:30);

julia> df[df.id .== 3, :age]
1-element Vector{Int64}:
 28

To return a value, you would need to do one of the following:

julia> df[df.id .== 3, :age] |> only
28

julia> df[findfirst(df.id .== 3), :age]
28
2 Likes

Indeed row-lookup is an important use case. And as it was already commented here it is not easy to design a good API. The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple).

I want to add something more user friendly in 1.7 release. But we need to discuss how you would want to achieve this. The discussion is in:

I assume that your use case is for situation that you expect an exactly one match, so essentially, what you want is a shorter way of writing only(filter(:id => ==(3), df).age)) (or some alternative syntax already available that was mentioned above). Is this correct?

Can you please comment in the linked issue, so that we can move forward with the decisions?

5 Likes

An important use case I often need: if a table has an index (key column), this defines a mapping between it and any other column, which can then be applied to other tables to add columns according to those mappings. A simple and efficient API for that would be nice (if it does not exist already).

What you describe is handled by joins (if I understand your need correctly).

Joins can handle this but quite slowly in my experience. Instead, I normally use:

lookup = Dict(df1.key_col .=> df1.value_col)
df2.values = [lookup[key] for key in df2.some_col]

But I guess one could argue that I should not be using a DataFrame, but a TypedTables.DictTable.

Thank you all for your answers. The solution df[df.patient_id .== 1234, :age] |> only seems to be the simplest !

Ok, no problem, I’ll go and comment on the git discussion.

In fact, I’m in a fairly simple situation where each line is unique (an id doesn’t appear twice).


d=Dict(zip(df.id,copy.(eachrow(df[:,2:end]))))

d1=Dict(zip(df.id,Tables.namedtupleiterator(df[:,2:end])))

d[3].age

d1[3].age



d1=Dict(zip(df.id,Tables.namedtupleiterator(df[:,Not(:id)])))
d1[3].age