Future directions for DataFrames.jl

rafael.guerra · August 19, 2021, 7:15am

What do you think about the solution proposed by @gcalderone? Linking also his post.

nalimilan · August 19, 2021, 10:24am

AFAICT the problem with that solution is that the meta-data will be lost after calling any function that is fowarded to the DataFrame method. The main point of having per-column metadata is that it should be preserved, e.g. when adding/removing/transforming columns, subsetting rows, joining, etc. Whether it’s done within the DataFrame implementation itself, or using another type, it requires special code in many places to do that.

(Metadata.jl can already be used to attach metadata to a DataFrame object, without even changing its type s but it has the same problem.)

ImreSamu · August 19, 2021, 3:58pm

IMHO:
The “handling data larger than fits into RAM” is a good candidate
as a Julia “Diversity and Inclusion initiates.”

If we try “recruiting folks from underrepresented communities to learn Julia”
and they have only ~ 1G / 2G / 4G RAM
Then the DataFrame is not the optimal knowledge for them.

in this case - the SQLite.jl is better for teaching ( and “handling data larger than fits into RAM” )

davibarreira · June 2, 2022, 4:09pm

@bkamins , has there been any advances in this front for DataFrames.jl? I mean, regarding data larger than fits into RAM. I’m in need of it, so perhaps I can contribute somehow.

jpsamaroo · June 2, 2022, 8:09pm

@bkamins, @krynju and I have been investigating how to achieve out-of-core compute capabilities with DataFrames by using Dagger’s DTable. The idea is that you would wrap your DataFrames with DTables, and then do operations on those DTables, which would then automatically move data out to disk as necessary. This would allow doing operations on DataFrames which would normally take more RAM than is available on your system.

This sounds slightly complicated, but it’s actually pretty simple to do in practice, and I’ll be publishing a blog post in the next few weeks showing off how this all works and how to use it. The feature work for the out-of-core portion is basically done, just needing tests and docs before I merge it.

@krynju is actively working on implementing the DataFrames mini-language (select, transform, etc.) for the DTable, which will allow code that’s written for DataFrames to work as-is with the DTable (and will also be imbued with the power of out-of-core).

bkamins · June 2, 2022, 8:30pm

Yes, I am very excited with @jpsamaroo + @krynju work and I hope by JuliaCon2022 there will be something available for user testing.

davibarreira · June 2, 2022, 8:48pm

Oh boy, that would be awesome! After using DataFrames.jl, I just find extremely painful to go back to python’s alternatives. I find DataFrames.jl just so intuitive. I can code without having to check for references on how to do something every minute or so.

xiaodai · June 3, 2022, 4:23am

i am working on this. An extension of the disk.frame brand from R. But it’s quite slow moving. Want to get the fundamental building blocks right

Topic		Replies	Views
DataFrames.jl development survey Data question , dataframes	52	2946	September 27, 2020
A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data Offtopic	24	858	September 16, 2024
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9373	January 1, 2025
Hierarchical or multi-index for data frames Data	10	7398	October 9, 2019
DataTables or DataFrames? Data question	32	15380	November 19, 2018

Future directions for DataFrames.jl

Related topics