ANN: DataDeps.jl: BinDeps for Data

I’ve now released DataDeps.jl,
which is a package for automating the setup of data that a package or scientific script might depend upon.

See the Repository: https://github.com/oxinabox/DataDeps.jl
and the demo blog post: DataDeps.jl -- Repeatabled Data Setup for Repeatable Science

This is v0.2.0 It’s had one solid round of improvements thanks to @Evizero setting it up for MLDataSets.jl (not quiet merged yet).
DataDeps.jl is in many ways an extension of @Evizero 's earlier download system.

18 Likes

This is amazing! Awesome work, and something I’m sure will prove very useful (and necessary) for many scientific research projects. Thank you!

This sounds super useful, but I want to be clear about how this works.

You store the data on some repository online, and it can be downloaded via the url.

Then when someone calls the readdir function, it takes the data off of the web and stores it in a temporary folder somewhere on your system. Then the file path datadep refers to that temp folder?

Really happy to see a proper package for that. Thanks for doing all the heavy lifting!

Not quite. readdir is not altered. All the magic happens happens when datadep"mydataset" is executed, which then afterwards simply returns the local path as a string.

Concerning local folder see the description for the DATADEPS_LOAD_PATH environment variable GitHub - oxinabox/DataDeps.jl: reproducible data setup for reproducible science . In short, its not really intended to be temporary

Looks really nice! One thing I didn’t manage to understand: is there a way to set it up to work with data that is not publicly accessible? I know that that seems to defeat the purpose of a package for reproducible science, but I wanted to understand whether the same tool can be used for the prepublication phase (in the case where the person performing the analysis also collected the data) to make the analysis reproducible within the lab for example.

2 Likes

Based on this, it seems that we will still run into the problem of global file paths. If I want to store the data somewhere other than C:/ProgramData on my windows machine, then I need to make sure that my global file path is the same as my coworker’s, or at least change it back and forth depending on who is running the code?

Yes, to an extend. AFAIK there is not logic for dealing with login credentials or something like that (if that was your question), but it does support working with data that needs to be downloaded manually by a user.

I am not sure I follow. If you are using data from the web (i.e. you let DataDeps do the downloading), then you and your co-worker can (and probably will) use very different local paths. This is all handled by DataDeps.

On the other hand if you wish to share data on a computer with different users you can also manually put the data into some shared folder and put add an entry for that folder into DATADEPS_LOAD_PATH.

is there a way to set it up to work with data that is not publicly accessible?

It is a good question.
I have been doing a bit of thinking about that yesterday (I drew a flow chart even but I haven’t released it).
The package is primarily concerned with getting the easy and fairly common case where the data is static and public.

If it is private, but not confidential, then I am thinking that a secret URL is probably fine. Things like Google Drive, dropbox and (at least my) universities data store offer those as a sharing option.
Slightly more secure than that would putting the data on a local websever that is firewalled to only allow local connections.

Beyond that, I believe it should be possible to modify the fetch_method (which you can already do on a per-datadep-registration level) to use something that does Auth (rather than Base.download).
I believe Basic HTTP auth wouldn’t be hard to setup. Something more complex like OAUTH probably would be.

I’ld be interested in talking to anyone who is in working in the “Gather Data, run code, repeat, publish” kind area and trying to get this going.
I’m in a “Use a external (standard) dataset, run code, repeat, publish” area, just adding more data sets, not more data.
So my notions could be off for those cases.
One thing for sure is that DataDeps.jl doesn’t know when you update your remote data-source.
It only attempts a download if it can;t find a local copy of the folder.

Of course the other thing to do is setup a ManualDataDep, and have a mounted networked filestore in your DATADEPS_LOAD_PATH.
Then that is all easy.
Then once you are about to publish, upload that to some service like FigShare, and change the registration block to a normal (automatic) DataDep.
That is probably a better work flow, thinking about it.

Their are no global file paths in your programs code.
Your code just contains various datadep"Data set name/file" strings.
DataDeps.jl resolves those into filepaths.
By search for "Data set name" in the DATADEPS_LOADPATH (which can and normally is a list of directories) and installing if not found.

Your DATADEPS_LOADPATH is an environmental variable, it is set per environment, i.e. computer.
it does default to a list which includes C:/ProgramData.
But I can go into my windows settings and add D:/ResearchData,
and my coworker could add H:/Data,
and my other coworker can add /mnt/NAS/DATA.

Thanks for asking, is that clear enough?
I should update the docs to make it more clear.
I think it doesn’t currently have the list of default locations.

This is more clear.

My understanding initially was that the DATADEPS_LOADPATH was determined in the module, in which case you would have to update it and the amount of work would be the same as global constants representing file paths. I need to work more with setting up working environments, clearly. I think the docs are clear.

Completely agree. I’m more in a collect data, look at it, collect more data, look at it, eventually publish workflow. Having to update manually when there’s more data doesn’t seem like a major concern if it is as simple as rm(datadep"MyData", recursive=true, force=true).

Concerning the privacy, the shareable link from Google Drive seems like a reasonable option: do you think it can also be made to work to use a link to a folder in Google Drive and then it would download the whole content? Otherwise getting the link for each file could be painful.

2 Likes

This looks cool, thanks!

Concerning the “private data” issue. Could rsync be used? Then data from any machine with SSH access can be downloaded. To not have to deal with passwords, you could support only password-less SSH.

1 Like

@oxinabox, Thanks for the package :+1:t3:

If it is private, but not confidential, then I am thinking that a secret URL
is probably fine. Things like Google Drive, dropbox and (at least my)
universities data store offer those as a sharing option.

We too have a permutation and combination of gitlab, local/external
servers and a [customized version without backups] of ownCloud for
file sync/share - storage pattern varies as per the team.

I’ld be interested in talking to anyone who is in working in the “Gather
Data, run code, repeat, publish” kind area and trying to get this going.
I’m in a “Use a external (standard) dataset, run code, repeat, publish”
area, just adding more data sets, not more data.
So my notions could be off for those cases.

I would be happy to talk off-list.

One thing for sure is that DataDeps.jl doesn’t know when you update your
remote data-source.
It only attempts a download if it can;t find a local copy of the folder.

Since the package only handles downloading data, what happens if the
uri link changes (dead link, moved data, or new data gets appended,
etc…)? For ex.: a researcher using github or gitlab[1] to store
(meta)data - could tagging a specific commit suffice?

Again, thanks :slight_smile: SVAKSHA ॥ SVAKSHA | about.me

[1] Edit: Regarding reproducibility, there are some ongoing discussions on the bug-tracker, Make Gitlab Repo Citable via data repositories (#35023) · Issues · GitLab.org / GitLab FOSS · GitLab

Storing data on git isn’t great.
but yes
Tagging a specific commit as the remote URL would (for git) at least let you know the data would never change.
Its less solving the problem and more avoiding it in the firstplace – which is better.

Since the package only handles downloading data, what happens if the
uri link changes (dead link, moved data, or new data gets appended,
etc…)?

When someone new uses the repo (or via CI tests (checkout travis cron to schedual period test running)) :
Dead-link should be picked up pretty quick, and you can handle it with a bug report /email and normal channels.
Similarly data changes will result in checksum fails and should be detected pretty quickly by same means

@mauro3:

Could rsync be used? Then data from any machine with SSH access can be downloaded. To not have to deal with passwords, you could support only password-less SSH.

Yes, probably. using some wrapper around rsync or scp rather than Base.download as the fetch mechanism.
Or you could just use SSHFS to mount the remote directory locally and use a ManualDataDep, but then no easy CI.

@piever

Concerning the privacy, the shareable link from Google Drive seems like a reasonable option: do you think it can also be made to work to use a link to a folder in Google Drive and then it would download the whole content? Otherwise getting the link for each file could be painful.

Maybe. Easiest would be if you can get a sharable link to the folder exported zip.
A google suggests that used to be possible, but I don’t know if it is today.

It would be good to add a GoogleDrive generator to DataDepsGenerators.jl, to get all the files.

And the other option is again synchronize externally, and us a ManualDataDep

1 Like

Not in reply to anyone

I am very tempted to add a HTTP.jl
(and thus a BinDeps.jl and MEDTLS.jl) dependency so I can do away with
the code in https://github.com/oxinabox/DataDeps.jl/pull/22/files
for determining the filenames of files being download.

The problem is basically that HTTP lets you specify the download filename in the headers – if not specified it falls back to the last part of the URL.

That code is really hard to keep working on multiple platforms and in dealling with multiple webservers.
I’m not sure if HTTP.jl will be better (esp given the later problem).

In general I am not super happy about the process of resolving filenames.
As it does not really work great if the remotepath is not a URL/ HTTP URI.
But I guess that is the majority use case, so if it is broken/awkward for other remotepath specifies,
and someone actually uses those, issues reports can be made then.

I ended up doing this as https://github.com/oxinabox/DataDeps.jl/pull/22/files
was just too hacky.