Julia stats, data, ML: expanding usability

bkamins · September 12, 2021, 7:26am

Regarding the design of DataFrames.jl I encourage you to open a separate thread and it would be great to discuss it. Recently we had a similar discussion here, and having such helps to improve the package and the ecosystem in general. There I propose we also can discuss the differences betwen DataFrames.jl and other ecosystems, but to just give you one of the design principles. DataFrame object is a light wrapper that stores any column you pass to it (as long as it is an AbstractVector). This flexibility has its benefits and costs, but this was the choice and design intention of original package authors:

To give you an example of benefit: you do not have a situation like in Polars where if you want to take a column from a data frame and use it with NumPy you should perform a conversion because their native storage format is different. Another example of benefit: we have full support of views as opposed to other ecosystems (which matters in practice when you have large data and do not have an infinite memory; this is especially relevant for wide tables).
To give an example of cost: in data.table one can sort a data frame by key column and then data.table sets a mark that data frame is sorted. This information is later used to speed up some operations. We cannot do such a thing in DataFrames.jl because of the flexibility we provide.

Regarding H2O benchmarks - unfortunately since mid June they are stalled (the old maintainer who was doing a great job was moved to other tasks AFAICT). I would assume that both Polars and DataFrames.jl would look differently now (these are two of the leading packages that are actively developed and have regular releases). Having said that, to repeat the comment I already made some time ago, we should not expect DataFrames.jl to be faster than e.g. Polars. Under the hood both go through LLVM infrastructure so if we would use the same algorithms the performance will be ultimately similar.

Topic		Replies	Views
Request for un'stdlibfication of Statistics Internals & Design statistics , community	78	6648	September 10, 2022
Please recommend a Julia ecosystem for Statistics New to Julia	28	4340	June 8, 2019
Suggestion: move DataFrames, plotting into standard distribution Internals & Design proposal , plotting , dataframes	45	4011	February 21, 2018
Teaching data analysis with Julia - what to do about DataFrames and all that? Data	18	5071	November 21, 2016
Multivariate OLS General Usage	17	5990	November 13, 2018

Julia stats, data, ML: expanding usability

Related topics