[ANN] DataToolkit.jl — Reproducible, flexible, and convenient data management

tecosaur · October 9, 2023, 4:46pm

Reproducible, flexible, and convenient data management

DataToolkit.jl is part of a trio of packages that aim to provide an exceptionally convenient and extensible approach to data management. Compared to pipelines (e.g. nextflow, snakemake) this focuses more on helping when you want to make the data handling in an individual project/one-shot analysis easy and reproducible.

It subsumes the capabilities of (a non-exhaustive list):

Here’s a taster of what using it looks like:

See the Introduction in the docs and/or my JuliaCon23 presentation [slides] to learn more

I’ve just released v0.7 , and pending development/feedback I plan on tagging 1.0 late this year. I have a few things I want to do before then such as:

Changing the checksum to KangarooTwelve/Blake3
Settle on a public API
More docs
More tests
More supported storage backends (artifacts, S3)

If this sounds of interest, please give it a whirl and let me know if you have any feedback, I’d be keen to hear thoughts/experiences with this .

adamslc · October 9, 2023, 7:12pm

This looks very interesting! Could you give a quick summery of the similarities/differences compared to DrWatson.jl?

tecosaur · October 10, 2023, 12:44am

To quote George:

the mission with DrWatson is to keep things as simple as possible. And DrWatson is not so much a data management system, but a “scientific project assistant” let’s say.

So it’s a bit like being asked to compare DataFrames.jl and MLJ.jl — both “do things” with tables but have entirely different concerns.

That said, this could probably be used in a DrWatson project, and maybe even integated more directly.

jules · October 10, 2023, 5:47am

How does this system deal with the versions of the loader packages? It seems to me like when CSV is needed, you’re prompted to install it in the usual way, giving you a manifest entry. But the data.toml doesn’t know about the version, so it would not be enough to be sure you get the same data on a different machine after deserialization. I guess checksumming can only apply to the raw bytes, so how can a user be sure that data is loaded as originally intended? Or is the system intended to work together with your manifest, so that only the combination of data.toml plus manifest.toml suffices to describe what you did.

tecosaur · October 10, 2023, 6:07am

Exactly what you suppose at the end. Data.toml is expected to live within a Julia project (next to the Project.toml and Manifest.toml). We let Pkg.jl do the excellent job it already does at reproducing package environments, and together with DataToolkit you get complete reproducibility of a project that involves loading data

Hmm, maybe it would be good to have an example repo showing an example data project and data package built using DataToolkit… . For the first of these, I could just save the end result of the tutorial.

jules · October 10, 2023, 7:23am

Ok so it’s less of a standalone data description than I first thought. But I guess that makes sense, given the many options data loading packages expose and which can be subject to change.

tecosaur · October 10, 2023, 8:01am

The only truly “standalone” way of handling data is a tool that never changes and just does basic downloads and maybe a checksum on a pile of bytes.

If you want to go beyond this (as DataToolkit tries to), you need to grapple with the fact that the data becomes intertwined with the way it is loaded/processed. Some tools (e.g. DataLad) let you make this reproducible by containerising the processing step(s), but if we indulge ourselves and make a Julia-specific tool we can just use Pkg.jl .

It’s also worth noting that the version of DataToolkit (+ dependencies) will itself also be managed by Pkg.jl of course, and in this way everything is (IMO) rather nicely bundled up.

In future, it could be nice to have a way to concatenate {Project,Manifest,Data}.toml into a single file that would be as portable as you seem to be thinking of. That said, .zip and .tar files already exist…

All that said, barring drastic changes to the loaders used (e.g. CSV.jl) the data files should be rather portable, and if you want to go a step further you can create a package that has a Data.toml along with a Project.toml. That’s a pretty good way of sharing datasets (since you can then just do using SomeDataSets and then … use them).

mthelm85 · October 10, 2023, 3:31pm

I have a few datasets that I use across multiple projects and, depending on the project, I only ever need to load a subset of all the available columns. When loading a dataset with DataToolkit.jl, is there a way to tell it which columns from a tabular dataset that I want to load? Something like the select keyword argument from CSV.jl?

tecosaur · October 10, 2023, 3:45pm

Not OOTB, but there are two ways you could do so:

Via derived data sets (e.g. create a-1 which is column 1 of a)
Via a plugin that adds that behaviour (plugins can do a lot)

DanielVandH · October 11, 2023, 4:20am

Great work! I really like the data REPL - very creative. The JuliaCon talk is great too, easy to follow - the part at 10:00 is awesome.

NicholasWMRitchie · October 11, 2023, 4:58pm

DrWatson.jl is more about creating well-organized, reproducible environments for performing data analysis. It builds off the package manager and uses a custom environment to ensure all calculations can be replicated in a consistent package environment. DataDeps.jl can be used to import the data into a DrWatson project. This can be very convenient when the data is bulky and/or must be stored according to a journal’s or your organization’s requirements. As a government scientist, I publish the data my organization repository (made findable through data.gov) and then a DrWatson project on Github to process the data.

They are both super useful.

Datseris · October 12, 2023, 9:17am

I agree. This sounds very promising. I’ve missed this announcement post but I’ll try to read in more detail and give more feedback. I am also happy to give support to better/more/simpler integration between the two packages!

tecosaur · April 7, 2024, 10:18am

With v0.9 just released, DataToolkit continues to support more file formats, with fewer bugs .

If you haven’t tried it out, or did before but quickly ran into an issue, I’d encourage you to give it another look. A few people have raised bugs/usability issues on GitHub, and that’s helped me improve the state of the project .

tecosaur · June 16, 2024, 8:21am

A lot of work is going into the next release (v0.10), to give a sneak peek (and since there’s no changelog ), here’s what I’ve done so far:

Bumping the minimum Julia to 1.9 and embracing package extensions
Replacing @import with @require: d9e9226
Removal of the SmallDict type (the original issue is a lot better with Memory in 1.11): 98a6723
Improved type inference/logic: 389df28
Separate the REPL mode and Store out into new packages
Rename DataToolkitBase to DataToolkitCore
Split DataToolkit into a more user-facing DataToolkit and package-facing (new) DataToolkitBase
Move all these packages into DataToolkit.jl monorepo
Improved load time and precompilation
Support for opening files as a FilePathsBase.AbstractPath
Support for (basic) S3 downloads
More image types (gif, webp)
Logging capability moved from DataToolkitCommon to DataToolkitCore. It works a bit differently (IMO, better) and is now configured by Preferences
Support working with directories as well as files, without compromising data integrity (thanks to cached merkle tree checksumming): e413116
WIP documentation improvements

Of these, I’d say the “headline” changes would be:

Directory support
Package restructuring to make it better suited for package-usage by having a new DataToolkitBase that doesn’t provide the data> REPL
The move to a monorepo

There’s still a good bit of work needed before I’m confident enough to cut the v0.10 release. In particular, I’m worried about new bugs with all the code changes. Once I’ve tinkered and tested this a bit more, I’ll see about doing so.

In the meantime, this is a great time to request design/feature/API tweaks and report bugs! Don’t hesitate to shoot me a message here, on slack, or zulip .

HanD · June 20, 2024, 9:50am

Hi, @tecosaur!

DataToolkit looks great, I’ll give it a spin some time in the future.

The monorepo layout you went for really piques my interest. I’m thinking of starting a similar monorepo in my work, but I have serious doubts how well Julia supports this layout. Perhaps you can shed some light on how you tacked the following issues:

There is a so-called package directories environment, the support doesn’t go far enough. In particular, Pkg doesn’t seem to be aware of this environment at all. In particular, I have problems with:
- Adding a sibling package A in the same package dir environment as a dependency of another package B. Pkg.add doesn’t work, because Pkg is unable to find the relevant sibling package. Pkg.dev works, but it has other issues (see below), and doesn’t need the package dir environment in the first place.
- Using packages from this environment fails when these packages have external dependencies (even if they are registered packages in the General registry).
One can use Pkg.dev to refer to packages within the monorepo, but that information is stored in the Manifest.toml file, which thus needs to be committed, and it’s also intransitive (i.e., a third package C using B will also have trouble finding A, unless the latter is added explicitly with dev).

I see in your monorepo source, that you refer to sibling packages just like any other package, so I assume you plan on registering them all in the General registry. Which is fine, I guess, but then how you work on WIP changes that affect multiple packages? They can’t be registered until the changes are merged, but they don’t compile (and thus cannot be tested and merged) until the new versions aren’t registered in the registry. Seems to be a Catch-22 to me. Do you use lots of Pkg.devs while in development? Or do you have an internal test registry to which you register versions still in development?

Any insight is welcome!

jules · June 20, 2024, 3:38pm

The monorepo makes this easier actually, because in your CI run you first check out the repo, that means the whole repo, so all sibling packages with their changes from that branch. You just have to dev them first because just running ]test would not pick them up from the local folder.

github.com

MakieOrg/Makie.jl/blob/225d0ae0e7c9db3bc43b2642327ee9a8ee84fed4/.github/workflows/reference_tests.yml#L43-L48


      
          run: |
            using Pkg;
            # dev mono repo versions
            pkg"registry up"
            Pkg.update()
            pkg"dev . ./MakieCore ./CairoMakie ./ReferenceTests"

In future Julia versions, I think one will be able to specify sources for packages in a Project.toml which should enable a simpler workflow where you specify directly that your test env depends on the local packages around it.

merlin · July 6, 2024, 7:50pm

Hi! I just learned about this and I’m trying it out. I have rolled my own package to organize the bazillion datasets I work with, and I’m ready for something much smarter – this looks great!

Questions coming to me after reading the docs:

What happens when I’ve added thousands of datasets?
What I mean is, any project might need to work across several arbitrary datasets. Therefore with a new ticket I would start a new project folder and would like to quickly reference any one of my datasets without doing this every time:

data> add PBB-198.customer_submits_msa_checkbox s3://muh_data/csv/dataset=pbb_198.customer_submits_msa_checkbox/rundate=2024-07-23/platform=mobile/data.csv

So I guess if I’m working in subdirectories of a single Data.toml project, that file is going to grow into MB size. Will that be an issue?

Can it handle partitions?
You mentioned ‘directory support’, does that mean datasets of multiple files across directories that represent partitions?

s3://muh_data/csv/dataset=customer_orders/rundate=2024-07-23/platform=mobile/data.csv

s3://muh_data/csv/dataset=customer_orders/rundate=2024-07-23/platform=web/data.csv

tecosaur · August 20, 2024, 9:57am

Seems like I missed the new reply notification somehow @merlin, but let’s see if I can help answer some of your questions

I’m not sure! For what I think you’re describing, it does sound like “one big Data.toml” might be what you want, but scaling isn’t something I’ve looked at much (I’ve had at most maybe a hundred datasets in a project at once). That said, I hope DataToolkit is fine with a few thousand datasets, and if it isn’t that’s something I’ll want to fix!

Not currently OOTB, I’m guessing what you mean by “partitions” is one folder of content that’s had it’s files split across multiple partial copies of the folder?

If this is correct, then I think I can see a way to implement merging of multiple “partial” folders using symlinks that should be fairly straightforward.

tecosaur · October 4, 2024, 2:46am

Update, I’m now working with a project that has ~20k datasets. It takes a moment to load, but seems completely fine.

merlin · October 4, 2024, 3:31pm

This is awesome info. It may seem like an edge case up front, but its one of those hard to unwind decisions that gave me pause before adopting this idea. Its so well thought out – thank you!

Topic		Replies	Views
DrWatson - the perfect sidekick to your scientific inquiries! Package Announcements	35	4689	June 23, 2019
ANN: DataDeps.jl: BinDeps for Data Data	16	3119	April 17, 2018
State of the Art for Data Version Control? General Usage question , data	5	757	November 30, 2022
Things that are easier in Julia than Python/R etc Community python , r	60	6999	October 17, 2021
What's the current (spring 2024) canonical approach to data science in Julia? General Usage dataframes	34	4168	April 8, 2024

[ANN] DataToolkit.jl — Reproducible, flexible, and convenient data management

Reproducible, flexible, and convenient data management

Related topics