A cool feature of Pandas is the ability to set an index on a table.
Thereafter you just specify table name and index value you want to use.
Combined with Julia tools like @rtransform a DataFrame cell could be specified with just Index_Value:ColumnName
. Something like:
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
@setindex :name
@rtransform :Age_relative_to_Sally = :age - Sally:age
end
The use case you propose is currently supported by groupby
, so you would write:
julia> @chain df begin
groupby(:name)
@rtransform(_, :Age_relative_to_Sally = (:age - only(_[("Sally",)].age)))
end
3Γ4 DataFrame
Row β name age children Age_relative_to_Sally
β String Float64 Int64 Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β John 23.0 3 -19.0
2 β Sally 42.0 5 0.0
3 β Kirk 59.0 2 17.0
Indeed, it is slightly verbose, so maybe @pdeffebach would consider adding support for something shorter than only(_[("Sally",)].age))
or _[("Sally",)].age)
(if unique value is not expected and user wants to work on a vector). Though IMO it is not extremely verbose.
The problem with index is that it is hard to propagate it (note e.g. the constant need to reindex
in Pandas; note that for this reason many other ecosystems - like in R both tidyverse and data.table do not support row index). Also it is in general such expressions as you have originally written are not clear if index is guaranteed to be unique or not.
Having said that, I recognize that many users like row indices - if someone proposed a consistent set of rules that would add significant value over using groupby
then we could consider adding it (it would be best if such a proposal were opened as an issue in DataFrames.jl).
Side note: you can relatively easily do your operation without setting an index (i.e. without groupby
), e.g.:
julia> @chain df begin
@rtransform(_, :Age_relative_to_Sally = (:age - only(_.age[_.name .== "Sally"])))
end
3Γ4 DataFrame
Row β name age children Age_relative_to_Sally
β String Float64 Int64 Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β John 23.0 3 -19.0
2 β Sally 42.0 5 0.0
3 β Kirk 59.0 2 17.0
6 Likes
Also maybe interested users can comment in Make row lookup easier Β· Issue #3051 Β· JuliaData/DataFrames.jl Β· GitHub as there I proposed a simpler syntax for this.
does something like this come close to your expectations?
df.Age_relative_to_Sally = df.age .- df[df.name .== "Sally",:].age[1]
That works but youβre re-stating the DataFrame name 4 times.
The nice thing about @rtransform is you just specify the column names. e.g :col3 = :col1 + :col2.
I just thought it would be lovely if this could extend to row names (by specified index). eg RowIndex:ColName.
BogumiΕ, thanks for the link. Iβll put an example in the discussion. May be of use.
The problem here, and why itβs unlikely to be implemented in DataFramesMeta, is that any off-the-shelf solution is likely to be quite slow. Even if we implemented convenient syntax for df[df.name .== "Sally", :]
, that would be very slow, since every time you call that expression, it has to find the name equal to "Sally"
and it isnβt cached.
Iβm hesitant to introduce something in DataFramesMeta which is going to always be slow.
I believe there is a package for which x .== "A string"
is fast due to some internal caching. Maybe someone can work out a solution which integrates that package with DataFrames a bit more. It would be an interesting experiment.
IndexedTables.jl
can be useful for this usecase. Even more generally, it supports multiple indexing levels.
Is there a vector type that has fast lookup we can spit out into itβs own package?
Yes, one of the reasons we havenβt implemented anything special in DataFrames is that itβs possible to get an efficient index column using findfirst
on an AcceleratedArray
column (e.g. df.age[findfirst(==("Sally"), df.name)])
. We could make this easier to discover and use though, by documenting it and providing a shorter syntax for this. Indeed whatβs tricky for users in the context of @rtransform
is that one has to access the whole vector for the lookup, which defeats the convenience of this macro. If the name is a literal, the lookup could ideally happen only once for all rows, making it much more efficient.
this is a version with in the name of the dataframe repeated only twice.
Although two more are needed to define the index.
the=Dict(zip(df.name,Tables.rowtable(df)))
df.Age_relative_to_Sally = df.age .- the["Sally"].age
df.Age_relative_to_Sally = df.age .- the["Sally"][:age]
Thatβs pretty nice. but the βtheβ object loses the other DataFrame functionality.
Here is another example.
I have a table of FX Forward rates Iβm using to bootstrap discount factors. Columns are (Currency Tenor Rate). Tenor values are: ON TN SPOT 1W 2W 1M 3M 6M 1Y 2Y.
The bootstrap is done separately per currency so first I do @groupby :Currency
Iβd like to then do @setindex :Tenor
And, using IndexValue:Column
notation, the formulas would be
:Fwd = SPOT:Rate + :Rate
SPOT:Fwd = SPOT:Rate
TN:Fwd = SPOT:Rate - TN:Rate
ON:Fwd = TN:Rate - ON:Rate
TodayRate = ON:Fwd
:Discount = :Fwd / TodayRate * USD_Discount(:Tenor)
Since this doesnβt exist Iβve converted the columns to named arrays (with names = Tenor).
The formulas become
Fwd = Rate[:SPOT] + Rate
Fwd[:SPOT] = Rate[:SPOT]
Fwd[:TN] = Rate[:SPOT] - Rate[:TN]
Fwd[:ON] = Rate[:TN] - Rate[:ON]
TodayRate = Fwd[:ON]
Discount = Fwd / TodayRate * USD_Discount(:Tenor)
Nammed arrays are good for this since you can do both dictionary style lookups and vector operations.
Iβd just prefer to do it all in DataFrames, without loosing the readability of the formulas.
1 Like
The test structure that came to my mind earlier was the following, based on namedtuples.
Then I thought that a dictionary would be preferable to do research.
the=(;zip(Symbol.(df.name),Tables.rowtable(df))...)
df.Age_relative_to_Sally = df.age .- the.Sally.age
df.Age_relative_to_Sally = df.age .- the[:Sally].age
df.Age_relative_to_Sally = df.age .- the[:Sally][:age]
The first construction seems to have the characteristics you say.
But I have no idea how many contraindications it may have.
julia> dfthe=DataFrame(the)
3Γ3 DataFrame
Row β name age children
β String Float64 Int64
ββββββΌβββββββββββββββββββββββββββ
1 β John 23.0 3
2 β Sally 42.0 5
3 β Kirk 59.0 2
julia> dfthe.age_rel_Sally=dfthe.age .- the.Sally.age
3-element Vector{Float64}:
-19.0
0.0
17.0
julia> dfthe
3Γ4 DataFrame
Row β name age children age_rel_Sally
β String Float64 Int64 Float64
ββββββΌββββββββββββββββββββββββββββββββββββββββββ
1 β John 23.0 3 -19.0
2 β Sally 42.0 5 0.0
3 β Kirk 59.0 2 17.0
Note that although the data frames are defined on different structures, they are the same
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
dfthe=DataFrame((;zip(Symbol.(df.name),Tables.rowtable(df))...))
julia> dfthe==df
true
Thatβs pretty good. Using that could write.
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
@aside x = (;zip(Symbol.(_.name),Tables.rowtable(_))...)
@rtransform :Age_relative_to_Sally = :age - x.Sally.age
end
But to access a cell from a newly created df column youβd have to re-create the named tuple structure.
Canβt there be a single data structure that allows cell operations (via index) and column operations. ?
If trying to boostrap rate curves, check out Yields.jl and let me know if you have any feedback!
Hi Milan
If I understand you correctly, this is exactly what I was hoping for.
So AcceleratedArray provides efficient index lookups as you show in the example with findfirst().
I was considering the case where the index column; has unique values, is specified in advance and the lookup value is specified as a string literal e.g. βSallyβ.
So within @rtransform
the system could check for any row lookups first.
Something like Sally:age
would be identified and the cell value (42) cached.
This would then be applied within each row level calculation.
If something like this were possible that would be amazing 
I have a dream of a space where you can specify a data-structure once (column names, indexes, joins, etc) Then within a block you just need column, row or cell names. You donβt need to repeat the dataset name or any of the structure information. MDX supports this but isnβt a general purpose language.
Within a general purpose language, the @chain
block with @rtransform @rsubset
etc are the closest Iβve seen.