Any plan for functionality like Pandas loc?

A cool feature of Pandas is the ability to set an index on a table.
Thereafter you just specify table name and index value you want to use.

Combined with Julia tools like @rtransform a DataFrame cell could be specified with just Index_Value:ColumnName. Something like:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])


@chain df begin

    @setindex        :name


    @rtransform     :Age_relative_to_Sally   =   :age  -  Sally:age

end

The use case you propose is currently supported by groupby, so you would write:

julia> @chain df begin
           groupby(:name)
           @rtransform(_, :Age_relative_to_Sally = (:age - only(_[("Sally",)].age)))
       end
3Γ—4 DataFrame
 Row β”‚ name    age      children  Age_relative_to_Sally
     β”‚ String  Float64  Int64     Float64
─────┼──────────────────────────────────────────────────
   1 β”‚ John       23.0         3                  -19.0
   2 β”‚ Sally      42.0         5                    0.0
   3 β”‚ Kirk       59.0         2                   17.0

Indeed, it is slightly verbose, so maybe @pdeffebach would consider adding support for something shorter than only(_[("Sally",)].age)) or _[("Sally",)].age) (if unique value is not expected and user wants to work on a vector). Though IMO it is not extremely verbose.

The problem with index is that it is hard to propagate it (note e.g. the constant need to reindex in Pandas; note that for this reason many other ecosystems - like in R both tidyverse and data.table do not support row index). Also it is in general such expressions as you have originally written are not clear if index is guaranteed to be unique or not.

Having said that, I recognize that many users like row indices - if someone proposed a consistent set of rules that would add significant value over using groupby then we could consider adding it (it would be best if such a proposal were opened as an issue in DataFrames.jl).

Side note: you can relatively easily do your operation without setting an index (i.e. without groupby), e.g.:

julia> @chain df begin
           @rtransform(_, :Age_relative_to_Sally = (:age - only(_.age[_.name .== "Sally"])))
       end
3Γ—4 DataFrame
 Row β”‚ name    age      children  Age_relative_to_Sally
     β”‚ String  Float64  Int64     Float64
─────┼──────────────────────────────────────────────────
   1 β”‚ John       23.0         3                  -19.0
   2 β”‚ Sally      42.0         5                    0.0
   3 β”‚ Kirk       59.0         2                   17.0
6 Likes

Also maybe interested users can comment in Make GroupedDataFrame callable Β· Issue #3051 Β· JuliaData/DataFrames.jl Β· GitHub as there I proposed a simpler syntax for this.

does something like this come close to your expectations?

df.Age_relative_to_Sally   =   df.age  .-  df[df.name .== "Sally",:].age[1]

That works but you’re re-stating the DataFrame name 4 times.
The nice thing about @rtransform is you just specify the column names. e.g :col3 = :col1 + :col2.
I just thought it would be lovely if this could extend to row names (by specified index). eg RowIndex:ColName.

BogumiΕ‚, thanks for the link. I’ll put an example in the discussion. May be of use.

The problem here, and why it’s unlikely to be implemented in DataFramesMeta, is that any off-the-shelf solution is likely to be quite slow. Even if we implemented convenient syntax for df[df.name .== "Sally", :], that would be very slow, since every time you call that expression, it has to find the name equal to "Sally" and it isn’t cached.

I’m hesitant to introduce something in DataFramesMeta which is going to always be slow.

I believe there is a package for which x .== "A string" is fast due to some internal caching. Maybe someone can work out a solution which integrates that package with DataFrames a bit more. It would be an interesting experiment.

IndexedTables.jl can be useful for this usecase. Even more generally, it supports multiple indexing levels.

Is there a vector type that has fast lookup we can spit out into it’s own package?

There is a package with exactly accelerated/indexed vectors: GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices.

Yes, one of the reasons we haven’t implemented anything special in DataFrames is that it’s possible to get an efficient index column using findfirst on an AcceleratedArray column (e.g. df.age[findfirst(==("Sally"), df.name)]). We could make this easier to discover and use though, by documenting it and providing a shorter syntax for this. Indeed what’s tricky for users in the context of @rtransform is that one has to access the whole vector for the lookup, which defeats the convenience of this macro. If the name is a literal, the lookup could ideally happen only once for all rows, making it much more efficient.

this is a version with in the name of the dataframe repeated only twice.
Although two more are needed to define the index.

the=Dict(zip(df.name,Tables.rowtable(df)))

df.Age_relative_to_Sally   =   df.age  .-  the["Sally"].age
df.Age_relative_to_Sally   =   df.age  .-  the["Sally"][:age]

That’s pretty nice. but the β€œthe” object loses the other DataFrame functionality.

Here is another example.
I have a table of FX Forward rates I’m using to bootstrap discount factors. Columns are (Currency Tenor Rate). Tenor values are: ON TN SPOT 1W 2W 1M 3M 6M 1Y 2Y.
The bootstrap is done separately per currency so first I do @groupby :Currency

I’d like to then do @setindex :Tenor
And, using IndexValue:Column notation, the formulas would be

:Fwd        =  SPOT:Rate  +  :Rate
SPOT:Fwd    =  SPOT:Rate
TN:Fwd      =  SPOT:Rate  -  TN:Rate
ON:Fwd      =  TN:Rate    -  ON:Rate
TodayRate   =  ON:Fwd
:Discount   =  :Fwd  /  TodayRate  *  USD_Discount(:Tenor)

Since this doesn’t exist I’ve converted the columns to named arrays (with names = Tenor).
The formulas become

Fwd          =  Rate[:SPOT]  + Rate
Fwd[:SPOT]   =  Rate[:SPOT]
Fwd[:TN]     =  Rate[:SPOT]  - Rate[:TN]
Fwd[:ON]     =  Rate[:TN]    - Rate[:ON]
TodayRate    =  Fwd[:ON]
Discount     =  Fwd  /  TodayRate * USD_Discount(:Tenor)

Nammed arrays are good for this since you can do both dictionary style lookups and vector operations.

I’d just prefer to do it all in DataFrames, without loosing the readability of the formulas.

1 Like

The test structure that came to my mind earlier was the following, based on namedtuples.
Then I thought that a dictionary would be preferable to do research.


the=(;zip(Symbol.(df.name),Tables.rowtable(df))...)


df.Age_relative_to_Sally   =   df.age  .-  the.Sally.age
df.Age_relative_to_Sally   =   df.age  .-  the[:Sally].age
df.Age_relative_to_Sally   =   df.age  .-  the[:Sally][:age]

The first construction seems to have the characteristics you say.
But I have no idea how many contraindications it may have.

julia> dfthe=DataFrame(the)
3Γ—3 DataFrame
 Row β”‚ name    age      children 
     β”‚ String  Float64  Int64
─────┼───────────────────────────
   1 β”‚ John       23.0         3
   2 β”‚ Sally      42.0         5
   3 β”‚ Kirk       59.0         2


julia> dfthe.age_rel_Sally=dfthe.age .- the.Sally.age
3-element Vector{Float64}:
 -19.0
   0.0
  17.0

julia> dfthe
3Γ—4 DataFrame
 Row β”‚ name    age      children  age_rel_Sally 
     β”‚ String  Float64  Int64     Float64
─────┼──────────────────────────────────────────
   1 β”‚ John       23.0         3          -19.0
   2 β”‚ Sally      42.0         5            0.0
   3 β”‚ Kirk       59.0         2           17.0

Note that although the data frames are defined on different structures, they are the same


df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
dfthe=DataFrame((;zip(Symbol.(df.name),Tables.rowtable(df))...))

julia> dfthe==df
true

That’s pretty good. Using that could write.

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

@chain df begin

    @aside x = (;zip(Symbol.(_.name),Tables.rowtable(_))...)
    
    @rtransform  :Age_relative_to_Sally    =     :age    -    x.Sally.age 

end

But to access a cell from a newly created df column you’d have to re-create the named tuple structure.

Can’t there be a single data structure that allows cell operations (via index) and column operations. ?

If trying to boostrap rate curves, check out Yields.jl and let me know if you have any feedback!

Hi Milan

If I understand you correctly, this is exactly what I was hoping for.
So AcceleratedArray provides efficient index lookups as you show in the example with findfirst().

I was considering the case where the index column; has unique values, is specified in advance and the lookup value is specified as a string literal e.g. β€œSally”.

So within @rtransform the system could check for any row lookups first.
Something like Sally:age would be identified and the cell value (42) cached.
This would then be applied within each row level calculation.

If something like this were possible that would be amazing :slight_smile:


I have a dream of a space where you can specify a data-structure once (column names, indexes, joins, etc) Then within a block you just need column, row or cell names. You don’t need to repeat the dataset name or any of the structure information. MDX supports this but isn’t a general purpose language.
Within a general purpose language, the @chain block with @rtransform @rsubset etc are the closest I’ve seen.