Why is it so complicated to access a row in a DataFrame?

fdekerme · August 24, 2023, 4:53pm

Hello,
This question has already been asked several times (here for example How to set a index in Julia's DataFrame? - Stack Overflow) but I still don’t understand why DataFrame.jl doesn’t have an index_col parameter like there is in Pandas (pandas.read_csv — pandas 2.0.3 documentation). Is it because of the internal design of the package?

For example, I’m reading medical data from an Excel table stored in a DataFrame named df. Each row corresponds to a patient. It would seem logical that I should be able to access a cell (say age) simply by typing df[patient_id, age].

For the moment, the solution I’ve come up with is to do

patient_row = filter(row -> row.patient_id == patient_id, df)
patient_rowt[!, ["age"]]

Which is incredibly complicated for this simple operation.
Maybe it’s just me who doesn’t know the right way to do it, in which case I’ll be grateful to anyone who does!

Thanks !
fdekerm

alfaromartino · August 24, 2023, 4:57pm

You can simply right
df[df.patient_id .== 1234, :age]

Nathan_Boyer · August 24, 2023, 5:48pm

I have wanted a simpler way to do this in the past too. The short versions are somewhat hard to discover on your own.

==(value) is a shorthand for x -> x == value.
. syntax retrieves the column you are interested in from the filtered data frame.
(df.age is equivalent to df[!, :age].)
only checks that there is only one value returned and returns it.
This is needed because you might have duplicate data or have passed a function like <(value) that would return multiple rows. I like to pipe |> to the only function, but you can also just wrap everything inside it only(filter(...)).

julia> df = DataFrame(patient_id = 1:5, age = 26:30)
5×2 DataFrame
 Row │ patient_id  age
     │ Int64       Int64
─────┼───────────────────
   1 │          1     26
   2 │          2     27
   3 │          3     28
   4 │          4     29
   5 │          5     30

julia> value = 3
3

julia> filter(:patient_id => ==(value), df).age |> only
28

julia> subset(df, :patient_id => ByRow(==(value))).age |> only
28

alfaromartino’s solution is shorter, but I dislike having to write df twice.
(You will still probably want to pass the output to only.)

You may also prefer the DataFramesMeta.jl syntax.
Here the macro @rsubset will automatically transform the written code into the subset call shown above. It eliminates the need to write => and ByRow explicitly.

julia> using DataFramesMeta

julia> @rsubset(df, :patient_id == value).age |> only
28

hendri54 · August 24, 2023, 5:56pm

That would allocate a vector to make the comparison.

I would consider (for nice, readable syntax):

patient_row(df, id) = findfirst(r -> r.patient_id == id, eachrow(df))
# or `only(findall())` if one wants to ensure that the id is unique

# produes a DataFrameRow if `cols` is Vector
get_patient(df, id, cols) = df[patient_row(df, id), cols]

(assuming each patient corresponds to exactly one row)

mbauman · August 24, 2023, 7:12pm

3 posts were split to a new topic: Performance of eachrow(::DataFrame)

alfaromartino · August 24, 2023, 7:01pm

I think that OP wasn’t asking about the most efficient way, but the simplest approach. If you want to keep your code simple, my recommendation is the following:

df = DataFrame(id = 1:5, weight= 56:60, age = 26:30)


df[df.id .== 1, :age] # for returning a value/vector

df[df.id .== 1, [:age]]  # for returning a df

df[in(1:2).(df.id), :age] # for selecting multiple ids

df[(df.id .== 1) .&& (df.weight .==56), :age] # for multiple conditions

Nathan_Boyer · August 24, 2023, 7:19pm

This will always return a vector:

julia> df = DataFrame(id = 1:5, weight= 56:60, age = 26:30);

julia> df[df.id .== 3, :age]
1-element Vector{Int64}:
 28

To return a value, you would need to do one of the following:

julia> df[df.id .== 3, :age] |> only
28

julia> df[findfirst(df.id .== 3), :age]
28

bkamins · August 24, 2023, 7:43pm

Indeed row-lookup is an important use case. And as it was already commented here it is not easy to design a good API. The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple).

I want to add something more user friendly in 1.7 release. But we need to discuss how you would want to achieve this. The discussion is in:

I assume that your use case is for situation that you expect an exactly one match, so essentially, what you want is a shorter way of writing only(filter(:id => ==(3), df).age)) (or some alternative syntax already available that was mentioned above). Is this correct?

Can you please comment in the linked issue, so that we can move forward with the decisions?

simsurace · August 24, 2023, 8:38pm

An important use case I often need: if a table has an index (key column), this defines a mapping between it and any other column, which can then be applied to other tables to add columns according to those mappings. A simple and efficient API for that would be nice (if it does not exist already).

bkamins · August 24, 2023, 8:41pm

What you describe is handled by joins (if I understand your need correctly).

simsurace · August 24, 2023, 9:04pm

Joins can handle this but quite slowly in my experience. Instead, I normally use:

lookup = Dict(df1.key_col .=> df1.value_col)
df2.values = [lookup[key] for key in df2.some_col]

But I guess one could argue that I should not be using a DataFrame, but a TypedTables.DictTable.

fdekerme · August 25, 2023, 8:13am

Thank you all for your answers. The solution df[df.patient_id .== 1234, :age] |> only seems to be the simplest !

fdekerme · August 25, 2023, 8:15am

Ok, no problem, I’ll go and comment on the git discussion.

In fact, I’m in a fairly simple situation where each line is unique (an id doesn’t appear twice).

rocco_sprmnt21 · August 25, 2023, 9:25pm


d=Dict(zip(df.id,copy.(eachrow(df[:,2:end]))))

d1=Dict(zip(df.id,Tables.namedtupleiterator(df[:,2:end])))

d[3].age

d1[3].age



d1=Dict(zip(df.id,Tables.namedtupleiterator(df[:,Not(:id)])))
d1[3].age

Topic		Replies	Views
Any plan for functionality like Pandas loc? General Usage dataframes	15	1170	May 27, 2022
Easier access to DataFrame's elements General Usage	6	1803	September 28, 2018
How to select dataframe rows in Julia 1.6? New to Julia question	1	5186	May 7, 2021
Is there a faster way to extract single rows of a dataframe than the filter function General Usage	3	653	February 18, 2020
Data structure for convenient access to tabular data General Usage dataframes	5	390	February 13, 2023

Why is it so complicated to access a row in a DataFrame?

Related topics