Any plan for functionality like Pandas loc?

Lincoln_Hannah · May 16, 2022, 7:15am

A cool feature of Pandas is the ability to set an index on a table.
Thereafter you just specify table name and index value you want to use.

Combined with Julia tools like @rtransform a DataFrame cell could be specified with just Index_Value:ColumnName. Something like:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])


@chain df begin

    @setindex        :name


    @rtransform     :Age_relative_to_Sally   =   :age  -  Sally:age

end

bkamins · May 16, 2022, 7:55am

The use case you propose is currently supported by groupby, so you would write:

julia> @chain df begin
           groupby(:name)
           @rtransform(_, :Age_relative_to_Sally = (:age - only(_[("Sally",)].age)))
       end
3×4 DataFrame
 Row │ name    age      children  Age_relative_to_Sally
     │ String  Float64  Int64     Float64
─────┼──────────────────────────────────────────────────
   1 │ John       23.0         3                  -19.0
   2 │ Sally      42.0         5                    0.0
   3 │ Kirk       59.0         2                   17.0

Indeed, it is slightly verbose, so maybe @pdeffebach would consider adding support for something shorter than only(_[("Sally",)].age)) or _[("Sally",)].age) (if unique value is not expected and user wants to work on a vector). Though IMO it is not extremely verbose.

The problem with index is that it is hard to propagate it (note e.g. the constant need to reindex in Pandas; note that for this reason many other ecosystems - like in R both tidyverse and data.table do not support row index). Also it is in general such expressions as you have originally written are not clear if index is guaranteed to be unique or not.

Having said that, I recognize that many users like row indices - if someone proposed a consistent set of rules that would add significant value over using groupby then we could consider adding it (it would be best if such a proposal were opened as an issue in DataFrames.jl).

Side note: you can relatively easily do your operation without setting an index (i.e. without groupby), e.g.:

julia> @chain df begin
           @rtransform(_, :Age_relative_to_Sally = (:age - only(_.age[_.name .== "Sally"])))
       end
3×4 DataFrame
 Row │ name    age      children  Age_relative_to_Sally
     │ String  Float64  Int64     Float64
─────┼──────────────────────────────────────────────────
   1 │ John       23.0         3                  -19.0
   2 │ Sally      42.0         5                    0.0
   3 │ Kirk       59.0         2                   17.0

bkamins · May 16, 2022, 8:12am

Also maybe interested users can comment in Make row lookup easier · Issue #3051 · JuliaData/DataFrames.jl · GitHub as there I proposed a simpler syntax for this.

rocco_sprmnt21 · May 16, 2022, 10:46am

does something like this come close to your expectations?

df.Age_relative_to_Sally   =   df.age  .-  df[df.name .== "Sally",:].age[1]

Lincoln_Hannah · May 16, 2022, 11:46am

That works but you’re re-stating the DataFrame name 4 times.
The nice thing about @rtransform is you just specify the column names. e.g :col3 = :col1 + :col2.
I just thought it would be lovely if this could extend to row names (by specified index). eg RowIndex:ColName.

Bogumił, thanks for the link. I’ll put an example in the discussion. May be of use.

pdeffebach · May 16, 2022, 12:47pm

The problem here, and why it’s unlikely to be implemented in DataFramesMeta, is that any off-the-shelf solution is likely to be quite slow. Even if we implemented convenient syntax for df[df.name .== "Sally", :], that would be very slow, since every time you call that expression, it has to find the name equal to "Sally" and it isn’t cached.

I’m hesitant to introduce something in DataFramesMeta which is going to always be slow.

I believe there is a package for which x .== "A string" is fast due to some internal caching. Maybe someone can work out a solution which integrates that package with DataFrames a bit more. It would be an interesting experiment.

aplavin · May 16, 2022, 1:33pm

IndexedTables.jl can be useful for this usecase. Even more generally, it supports multiple indexing levels.

pdeffebach · May 16, 2022, 1:37pm

Is there a vector type that has fast lookup we can spit out into it’s own package?

aplavin · May 16, 2022, 2:32pm

There is a package with exactly accelerated/indexed vectors: GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices.

nalimilan · May 16, 2022, 3:17pm

Yes, one of the reasons we haven’t implemented anything special in DataFrames is that it’s possible to get an efficient index column using findfirst on an AcceleratedArray column (e.g. df.age[findfirst(==("Sally"), df.name)]). We could make this easier to discover and use though, by documenting it and providing a shorter syntax for this. Indeed what’s tricky for users in the context of @rtransform is that one has to access the whole vector for the lookup, which defeats the convenience of this macro. If the name is a literal, the lookup could ideally happen only once for all rows, making it much more efficient.

rocco_sprmnt21 · May 16, 2022, 9:58pm

this is a version with in the name of the dataframe repeated only twice.
Although two more are needed to define the index.

the=Dict(zip(df.name,Tables.rowtable(df)))

df.Age_relative_to_Sally   =   df.age  .-  the["Sally"].age
df.Age_relative_to_Sally   =   df.age  .-  the["Sally"][:age]

Lincoln_Hannah · May 17, 2022, 12:07am

That’s pretty nice. but the “the” object loses the other DataFrame functionality.

Here is another example.
I have a table of FX Forward rates I’m using to bootstrap discount factors. Columns are (Currency Tenor Rate). Tenor values are: ON TN SPOT 1W 2W 1M 3M 6M 1Y 2Y.
The bootstrap is done separately per currency so first I do @groupby :Currency

I’d like to then do @setindex :Tenor
And, using IndexValue:Column notation, the formulas would be

:Fwd        =  SPOT:Rate  +  :Rate
SPOT:Fwd    =  SPOT:Rate
TN:Fwd      =  SPOT:Rate  -  TN:Rate
ON:Fwd      =  TN:Rate    -  ON:Rate
TodayRate   =  ON:Fwd
:Discount   =  :Fwd  /  TodayRate  *  USD_Discount(:Tenor)

Since this doesn’t exist I’ve converted the columns to named arrays (with names = Tenor).
The formulas become

Fwd          =  Rate[:SPOT]  + Rate
Fwd[:SPOT]   =  Rate[:SPOT]
Fwd[:TN]     =  Rate[:SPOT]  - Rate[:TN]
Fwd[:ON]     =  Rate[:TN]    - Rate[:ON]
TodayRate    =  Fwd[:ON]
Discount     =  Fwd  /  TodayRate * USD_Discount(:Tenor)

Nammed arrays are good for this since you can do both dictionary style lookups and vector operations.

I’d just prefer to do it all in DataFrames, without loosing the readability of the formulas.

rocco_sprmnt21 · May 17, 2022, 7:16pm

The test structure that came to my mind earlier was the following, based on namedtuples.
Then I thought that a dictionary would be preferable to do research.


the=(;zip(Symbol.(df.name),Tables.rowtable(df))...)


df.Age_relative_to_Sally   =   df.age  .-  the.Sally.age
df.Age_relative_to_Sally   =   df.age  .-  the[:Sally].age
df.Age_relative_to_Sally   =   df.age  .-  the[:Sally][:age]

The first construction seems to have the characteristics you say.
But I have no idea how many contraindications it may have.

julia> dfthe=DataFrame(the)
3×3 DataFrame
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ John       23.0         3
   2 │ Sally      42.0         5
   3 │ Kirk       59.0         2


julia> dfthe.age_rel_Sally=dfthe.age .- the.Sally.age
3-element Vector{Float64}:
 -19.0
   0.0
  17.0

julia> dfthe
3×4 DataFrame
 Row │ name    age      children  age_rel_Sally 
     │ String  Float64  Int64     Float64
─────┼──────────────────────────────────────────
   1 │ John       23.0         3          -19.0
   2 │ Sally      42.0         5            0.0
   3 │ Kirk       59.0         2           17.0

Note that although the data frames are defined on different structures, they are the same


df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
dfthe=DataFrame((;zip(Symbol.(df.name),Tables.rowtable(df))...))

julia> dfthe==df
true

Lincoln_Hannah · May 18, 2022, 1:25am

That’s pretty good. Using that could write.

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

@chain df begin

    @aside x = (;zip(Symbol.(_.name),Tables.rowtable(_))...)
    
    @rtransform  :Age_relative_to_Sally    =     :age    -    x.Sally.age 

end

But to access a cell from a newly created df column you’d have to re-create the named tuple structure.

Can’t there be a single data structure that allows cell operations (via index) and column operations. ?

Alec_Loudenback · May 20, 2022, 12:53pm

If trying to boostrap rate curves, check out Yields.jl and let me know if you have any feedback!

Lincoln_Hannah · May 27, 2022, 4:09am

Hi Milan

If I understand you correctly, this is exactly what I was hoping for.
So AcceleratedArray provides efficient index lookups as you show in the example with findfirst().

I was considering the case where the index column; has unique values, is specified in advance and the lookup value is specified as a string literal e.g. “Sally”.

So within @rtransform the system could check for any row lookups first.
Something like Sally:age would be identified and the cell value (42) cached.
This would then be applied within each row level calculation.

If something like this were possible that would be amazing

I have a dream of a space where you can specify a data-structure once (column names, indexes, joins, etc) Then within a block you just need column, row or cell names. You don’t need to repeat the dataset name or any of the structure information. MDX supports this but isn’t a general purpose language.
Within a general purpose language, the @chain block with @rtransform @rsubset etc are the closest I’ve seen.

Topic		Replies	Views
Why is it so complicated to access a row in a DataFrame? General Usage dataframes	13	1975	August 25, 2023
String Index for DataFrames Data question	1	2306	September 4, 2019
Data structure for convenient access to tabular data General Usage dataframes	5	390	February 13, 2023
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10530	April 22, 2022
DataFrame: how to change value of a cell without knowing the row number Data dataframes	23	7166	January 25, 2023

Any plan for functionality like Pandas loc?

Related topics