ANN: DataDeps.jl: BinDeps for Data

Storing data on git isn’t great.
but yes
Tagging a specific commit as the remote URL would (for git) at least let you know the data would never change.
Its less solving the problem and more avoiding it in the firstplace – which is better.

Since the package only handles downloading data, what happens if the
uri link changes (dead link, moved data, or new data gets appended,
etc…)?

When someone new uses the repo (or via CI tests (checkout travis cron to schedual period test running)) :
Dead-link should be picked up pretty quick, and you can handle it with a bug report /email and normal channels.
Similarly data changes will result in checksum fails and should be detected pretty quickly by same means

@mauro3:

Could rsync be used? Then data from any machine with SSH access can be downloaded. To not have to deal with passwords, you could support only password-less SSH.

Yes, probably. using some wrapper around rsync or scp rather than Base.download as the fetch mechanism.
Or you could just use SSHFS to mount the remote directory locally and use a ManualDataDep, but then no easy CI.

@piever

Concerning the privacy, the shareable link from Google Drive seems like a reasonable option: do you think it can also be made to work to use a link to a folder in Google Drive and then it would download the whole content? Otherwise getting the link for each file could be painful.

Maybe. Easiest would be if you can get a sharable link to the folder exported zip.
A google suggests that used to be possible, but I don’t know if it is today.

It would be good to add a GoogleDrive generator to DataDepsGenerators.jl, to get all the files.

And the other option is again synchronize externally, and us a ManualDataDep

1 Like