What do you think about the solution proposed by @gcalderone? Linking also his post.
AFAICT the problem with that solution is that the meta-data will be lost after calling any function that is fowarded to the DataFrame
method. The main point of having per-column metadata is that it should be preserved, e.g. when adding/removing/transforming columns, subsetting rows, joining, etc. Whether it’s done within the DataFrame
implementation itself, or using another type, it requires special code in many places to do that.
(Metadata.jl can already be used to attach metadata to a DataFrame
object, without even changing its type s but it has the same problem.)
IMHO:
The “handling data larger than fits into RAM” is a good candidate
as a Julia “Diversity and Inclusion initiates.”
If we try “recruiting folks from underrepresented communities to learn Julia”
and they have only ~ 1G / 2G / 4G RAM
Then the DataFrame
is not the optimal knowledge for them.
in this case - the SQLite.jl is better for teaching ( and “handling data larger than fits into RAM” )
@bkamins , has there been any advances in this front for DataFrames.jl? I mean, regarding data larger than fits into RAM. I’m in need of it, so perhaps I can contribute somehow.
@bkamins, @krynju and I have been investigating how to achieve out-of-core compute capabilities with DataFrames by using Dagger’s DTable
. The idea is that you would wrap your DataFrame
s with DTable
s, and then do operations on those DTable
s, which would then automatically move data out to disk as necessary. This would allow doing operations on DataFrame
s which would normally take more RAM than is available on your system.
This sounds slightly complicated, but it’s actually pretty simple to do in practice, and I’ll be publishing a blog post in the next few weeks showing off how this all works and how to use it. The feature work for the out-of-core portion is basically done, just needing tests and docs before I merge it.
@krynju is actively working on implementing the DataFrames mini-language (select
, transform
, etc.) for the DTable
, which will allow code that’s written for DataFrame
s to work as-is with the DTable
(and will also be imbued with the power of out-of-core).
Yes, I am very excited with @jpsamaroo + @krynju work and I hope by JuliaCon2022 there will be something available for user testing.
Oh boy, that would be awesome! After using DataFrames.jl, I just find extremely painful to go back to python’s alternatives. I find DataFrames.jl just so intuitive. I can code without having to check for references on how to do something every minute or so.
i am working on this. An extension of the disk.frame brand from R. But it’s quite slow moving. Want to get the fundamental building blocks right