This question has already been asked several times (here for example How to set a index in Julia's DataFrame? - Stack Overflow) but I still don’t understand why DataFrame.jl doesn’t have an
index_col parameter like there is in Pandas (pandas.read_csv — pandas 2.0.3 documentation). Is it because of the internal design of the package?
For example, I’m reading medical data from an Excel table stored in a DataFrame named
df. Each row corresponds to a patient. It would seem logical that I should be able to access a cell (say age) simply by typing
For the moment, the solution I’ve come up with is to do
patient_row = filter(row -> row.patient_id == patient_id, df)
Which is incredibly complicated for this simple operation.
Maybe it’s just me who doesn’t know the right way to do it, in which case I’ll be grateful to anyone who does!
You can simply right
df[df.patient_id .== 1234, :age]
I have wanted a simpler way to do this in the past too. The short versions are somewhat hard to discover on your own.
==(value) is a shorthand for
x -> x == value.
. syntax retrieves the column you are interested in from the filtered data frame.
df.age is equivalent to
only checks that there is only one value returned and returns it.
This is needed because you might have duplicate data or have passed a function like
<(value) that would return multiple rows. I like to pipe
|> to the
only function, but you can also just wrap everything inside it
julia> df = DataFrame(patient_id = 1:5, age = 26:30)
Row │ patient_id age
│ Int64 Int64
1 │ 1 26
2 │ 2 27
3 │ 3 28
4 │ 4 29
5 │ 5 30
julia> value = 3
julia> filter(:patient_id => ==(value), df).age |> only
julia> subset(df, :patient_id => ByRow(==(value))).age |> only
alfaromartino’s solution is shorter, but I dislike having to write
(You will still probably want to pass the output to
You may also prefer the DataFramesMeta.jl syntax.
Here the macro
@rsubset will automatically transform the written code into the
subset call shown above. It eliminates the need to write
julia> using DataFramesMeta
julia> @rsubset(df, :patient_id == value).age |> only
That would allocate a vector to make the comparison.
I would consider (for nice, readable syntax):
patient_row(df, id) = findfirst(r -> r.patient_id == id, eachrow(df))
# or `only(findall())` if one wants to ensure that the id is unique
# produes a DataFrameRow if `cols` is Vector
get_patient(df, id, cols) = df[patient_row(df, id), cols]
(assuming each patient corresponds to exactly one row)
3 posts were split to a new topic: Performance of eachrow(::DataFrame)
I think that OP wasn’t asking about the most efficient way, but the simplest approach. If you want to keep your code simple, my recommendation is the following:
df = DataFrame(id = 1:5, weight= 56:60, age = 26:30)
df[df.id .== 1, :age] # for returning a value/vector
df[df.id .== 1, [:age]] # for returning a df
df[in(1:2).(df.id), :age] # for selecting multiple ids
df[(df.id .== 1) .&& (df.weight .==56), :age] # for multiple conditions
This will always return a vector:
julia> df = DataFrame(id = 1:5, weight= 56:60, age = 26:30);
julia> df[df.id .== 3, :age]
To return a value, you would need to do one of the following:
julia> df[df.id .== 3, :age] |> only
julia> df[findfirst(df.id .== 3), :age]
Indeed row-lookup is an important use case. And as it was already commented here it is not easy to design a good API. The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple).
I want to add something more user friendly in 1.7 release. But we need to discuss how you would want to achieve this. The discussion is in:
I assume that your use case is for situation that you expect an exactly one match, so essentially, what you want is a shorter way of writing
only(filter(:id => ==(3), df).age)) (or some alternative syntax already available that was mentioned above). Is this correct?
Can you please comment in the linked issue, so that we can move forward with the decisions?
An important use case I often need: if a table has an index (key column), this defines a mapping between it and any other column, which can then be applied to other tables to add columns according to those mappings. A simple and efficient API for that would be nice (if it does not exist already).
What you describe is handled by joins (if I understand your need correctly).
Joins can handle this but quite slowly in my experience. Instead, I normally use:
lookup = Dict(df1.key_col .=> df1.value_col)
df2.values = [lookup[key] for key in df2.some_col]
But I guess one could argue that I should not be using a DataFrame, but a TypedTables.DictTable.
Thank you all for your answers. The solution
df[df.patient_id .== 1234, :age] |> only seems to be the simplest !
Ok, no problem, I’ll go and comment on the git discussion.
In fact, I’m in a fairly simple situation where each line is unique (an id doesn’t appear twice).