Hierarchical or multi-index for data frames


#1

Hi,
I’ve been looking around for data frame implementations. I found that https://github.com/JuliaData/DataFrames.jl does not support a hierarchical or multi-index approach.
For clarification, I’m coming from pandas where multi-index is described here: https://pandas.pydata.org/pandas-docs/stable/advanced.html#multiindex-advanced-indexing

After a bit of searching, I am under the impression that multi-index in DataFrames.jl is left out by design. Mentioned in Aug 2017: A spreadsheet-like pivot function

I’m now wondering if it is really left out by design and what the reasons are behind that decision. Is it mainly performance-driven?

What are the alternatives?
As a workaround, I could use a dict of data frames. Are there other libraries that I’m missing?
@nalimilan mentioned " NamedArray an AxisArray or an IndexedTable " back then.

As an example, I want to implement the following pseudo-code:

df = DataFrame({"AAPL": {"open": [1, 2, 3, 4], "close": [1, 2, 3, 4]}, "MSFT": "open": [10, 20,...], "close": [10, 20,...]}}
df['close'].max() # max for AAPL and MSFT
df['range'] = df['close'] - df['open'] # calculates for both AAPL and MSFT

What is the idiomatic approach for this in Julia?


#2

Not really, in the sense that DataFrames has never had a strong “design”: it is the oldest of the “tabular” packages in Julia (in fact, it is the sixth-oldest package). Overall, it was roughly influenced by R data.frames, but feature additions have mainly been driven by what users have needed for their own work.

However the tabular data ecosystem suffers a bit from the Lisp Curse, in that it is often easier to write a new package that solves your problem than try to shoehorn your solution into an existing package, so there have been a number of other tabular related packages developed over time. For your particular problem, I would suggest IndexedTables, which has been designed for this exact use-case. You might also be interested in JuliaDB, which is a distributed extension of IndexedTables, but unfortunately is not yet running on 1.0.


#3

Actually I think basic design principles have been followed quite consistently over time. DataFrames indeed inspired by R’s data.frame, and by the more modern dplyr/tibble approach, rather than by Pandas.

I’m not very familiar with Pandas, but AFAIK you can achieve the same thing without multi-indexes using subsetting or split-apply-combine operations. It would be interesting to think about how this kind of feature could be added to DataFrames, and whether it would conflict with other features. Indexes could be represented by special columns, which could also be used a primary keys to increase performance of grouping operations.


#4

I’ve never used it personally but there is an implementation of something like this already: https://github.com/invenia/KeyedFrames.jl

It could be nice to add something like this to DataFrames for consistency with IndexedTables. At some point I was playing with the idea of replacing ColDict from IndexedTables (an untyped IndexedTable that allows replacing columns with columns of a different type) with a DataFrame but the one thing missing was support for primary keys in DataFrames (also DataFrame requires name for the columns but from what I understand IndexedTables may do so to in the future).


#5

Thank you for your responses.
The Lisp Curse essay has been eye-opening.
I always considered it positive for a programming language when library development was simple and similar to the application development. Now, I see that it makes consensus much harder.

I’m not very familiar with Pandas, but AFAIK you can achieve the same thing without multi-indexes using subsetting or split-apply-combine operations.

You certainly can achieve this in many different ways. If you store as follows, you can add and subtract columns, completely ignoring the symbols.

symbol | seqnum | open | close
AAPL   | 1      | 1    | 2
AAPL   | 2      | 3    | 4
MSFT   | 1      | 10   | 11
MSFT   | 2      | 12   | 13
....
df['close'].max() # would require some grouping first but should not be too cumbersome
df['range'] = df['close'] - df['open'] # should just do the right thing

Can’t really decide whether I find this “good enough” right now.

Is there an overview on what’s happening in this domain right now?
I’m aware of:

Interfaces that aim to translate between different implementions appear to be:

What’s the best way to find out how these implementations compare? Has someone written a meta-review of some sort?
Who are all the Julia_ companies behind these projects? JuliaComputing is the main driver behind Julia, right?
Are JuliaData and JuliaArrays different companies, subdivisions of JuliaComputing or “casual” groups of people that happen to be interested in this topic?


#6

As far as I understand, of those you mentioned above, JuliaComputing is the only company and the other organizations are mostly volunteers, but this doesn’t necessarily mean the packages there are maintained less promptly (for example, DataFrames was updated to Julia 1.0 before IndexedTables and JuliaDB - for technical reasons - is still in the process of being ported): DataFrames is so widely used that the odds of it becoming abandoned / unmaintained are quite small. I think the APIs of DataFrames and IndexedTables / JuliaDB are slowly converging (and DataFramesMeta / JuliaDBMeta give you some macros to simplify things and look also quite similar to one another). If you can’t choose the easiest is to go through the same tutorial for both packages and decide which one you like best:

  • DataFrames(Meta) (this is from a blog post, I wonder whether there is an “official” version of the tutorial)
  • JuliaDB and JuliaDBMeta (warning: this is not ported to Julia 1.0 yet, but may give you an idea of whether you find the API more or less intuitive)

#7

Concerning the problem at hand, I would tend to believe that for this specific usecase the “long” format seems better.

However, the cool thing is that in Julia things mix and match very easily. I’ve just realised that you can combine DataFrames with StructArrays for some nice tricks here. I still find the long format more convenient, but this is kind of fun - in the spirit of the lisp curse…

julia> using DataFrames, StructArrays

julia> df = DataFrame(AAPL = StructArray(open = 1:3, close = 2:4),
       MSFT = StructArray(open = 10:12, close = (10:12) .+ rand.()))
3×2 DataFrame
│ Row │ AAPL                  │ MSFT                         │
│     │ NamedTup…             │ NamedTup…                    │
├─────┼───────────────────────┼──────────────────────────────┤
│ 1   │ (open = 1, close = 2) │ (open = 10, close = 10.2042) │
│ 2   │ (open = 2, close = 3) │ (open = 11, close = 11.0493) │
│ 3   │ (open = 3, close = 4) │ (open = 12, close = 12.1066) │

julia> df.AAPL.open
1:3

julia> df.MSFT.close
3-element Array{Float64,1}:
 10.204195399872209
 11.04927211234995 
 12.106563317898413

julia> max_close(t) = maximum(t.close)
max_close (generic function with 1 method)

julia> aggregate(df, max_close)
1×2 DataFrame
│ Row │ AAPL_max_close │ MSFT_max_close │
│     │ Int64          │ Float64        │
├─────┼────────────────┼────────────────┤
│ 1   │ 4              │ 12.1066        │

the idea being that in Julia you can have an array-like object (that can therefore be used as a column) that iterates NamedTuples but has also accessible columns (data is stored in efficient columnar way).

I think the interplay between DataFrames and custom array type in Julia has a lot of promise, this type of data nesting could be one interesting direction.

EDIT: just to clarify, I do not recommend really using any of the above - the long format seems better - but was curious to see how many things work already out of the box and I think adding a simple two-way converter between DataFrames and some AbstractArray of NamedTuples with columnar storage (where the same API as DataFrames would still work wherever possible) is definitely interesting.


#8

Apart from Julia Computing, which is a company, most Julia* organizations are just open teams of volunteers from various horizons, with lots of intersection between them. So the ecosystem isn’t as fragmented as it may seem.

Basically, the tables implementations are currently DataFrames (and KeyedFrames which is based on it), IndexedTables/JuliaDB, and TypedTables. DataFrames is the simple in-memory table storage, JuliaDB supports distributed and out-of-core operations (with IndexedTable being a simple in-memory version which is strongly typed), and TypedTables is another strongly-typed table. Whether strong typing is beneficial probably depends on whether you work with many tables with different column types or not, as the compilation overhead can be significant.

AxisArrays and NamedArrays are quite different I would say, they are more arrays than tables, even though they can have some use cases in common with tables.

Then there are currently two generic interfaces: Tables (which replaces DataStreams) and TableTraits. There’s hope that we could agree on a single common interface, but it’s still under discussion. As @piever said we would also like to converge to similar APIs when that’s possible.


#9

Again, thank you so much for your input.

Then there are currently two generic interfaces: Tables (which replaces DataStreams) and TableTraits. There’s hope that we could agree on a single common interface, but it’s still under discussion.

This thread really puts Tables.jl into context.

@piever: I really liked the two tutorials you linked. And nesting arrays into DataFrame seems really creative. Definitely has potential although probably not as standard operating procedure, like you said.


#10

To be fair, I don’t think the Lisp Curse essay linked above describes the current situation with Julia packages, particularly data frames & friends, very well.

My understanding is that to a large extent, experimentation is happening because people want to explore various approaches to make things fast and convenient at the same time. Julia has introduced a combination of parametric types, multiple dispatch and AOT compilation which is unprecedented, and people need learn how to work with this efficiently. Also, the language is shaped by the results of these experiments (look at the history of the Nullable type).

It is best to approach Julia with the understanding that there will be many different approaches in the medium run before a few of them will emerge as a de facto standard. But the language is eminently capable of handling this, provided that there is a common interface.