Regarding the design of DataFrames.jl I encourage you to open a separate thread and it would be great to discuss it. Recently we had a similar discussion here, and having such helps to improve the package and the ecosystem in general. There I propose we also can discuss the differences betwen DataFrames.jl and other ecosystems, but to just give you one of the design principles. DataFrame
object is a light wrapper that stores any column you pass to it (as long as it is an AbstractVector
). This flexibility has its benefits and costs, but this was the choice and design intention of original package authors:
- To give you an example of benefit: you do not have a situation like in Polars where if you want to take a column from a data frame and use it with NumPy you should perform a conversion because their native storage format is different. Another example of benefit: we have full support of views as opposed to other ecosystems (which matters in practice when you have large data and do not have an infinite memory; this is especially relevant for wide tables).
- To give an example of cost: in data.table one can sort a data frame by key column and then data.table sets a mark that data frame is sorted. This information is later used to speed up some operations. We cannot do such a thing in DataFrames.jl because of the flexibility we provide.
Regarding H2O benchmarks - unfortunately since mid June they are stalled (the old maintainer who was doing a great job was moved to other tasks AFAICT). I would assume that both Polars and DataFrames.jl would look differently now (these are two of the leading packages that are actively developed and have regular releases). Having said that, to repeat the comment I already made some time ago, we should not expect DataFrames.jl to be faster than e.g. Polars. Under the hood both go through LLVM infrastructure so if we would use the same algorithms the performance will be ultimately similar.