today instead of doing my work, I learned about the new artifact system, and I am absolutely fascinated.
I am currently working on a project that uses some data files, which I store in my Onedrive for Business folder. My module containing all functions also lives in my Onedrive, so I can use them by having a relative path between them, which is clunky but works.
However, the computations are quite expensive, so I intend to move to a server soon, which raises the question of how to setup everything that I need.
My thought just now was to use the new artifact system, add all my data files as artifacts, and lazily load them as I need them. I already saw that I can put the recipes for BinaryBuilder into my own private repo, as my modules already are. So I only need to setup ssh/deploy keys on the server to able to clone easily from github.
My questions however are thus:
- Where to store my data and how to securely access it:
The files are already on Onedrive, can I just put the Onedrive download link in the Artifact.toml? Can I maybe even include some authentication so that not anyone with the link can just download from my drive? The data is not really cleared for general release.
- Is this even a good idea?
Given that time is always short, and fast, hacky results are rather sometimes more appreciated than slow, elegant ones I worried whether this makes sense to do. I would hope that once I learn how the artifact system works it will have been worth it, because it will be very fast with all future projects.
I apologize for asking without trying myself, but time is really short right now and if the system is not quite there yet, I may wait for now.
I suppose, once I talk about authentication with Onedrive, I can just have a download script that gets everything from Onedrive. However, I like the idea of having a nice, reproducible workflow, where I clone my packages on some server, run my script, and the data as loaded as needed.
Any thoughts on this are highly appreciated.
I’m interested in this too. Presumably if the source API has ssh access, one could rely on key-based authentication, but the artifact system would have to have the ability to expect that.
Not to hijack the conversation, I’m wondering what the differences are between DataDeps.jl and the new artifacts system…
what the differences are between DataDeps.jl
I would also like to know what the story is here. I think DataDeps regularizes the experience for accessing and documenting data for the user but it seems like that it is mostly semantics with the new Artifacts system.
@oxinabox can speak to this more directly, but as I understand, DataDeps was designed before Artifacts were available. There are a lot of overlaps in functionality, but IIRC, there are still some things DataDeps can do that artifacts can’t. For one thing, DataDeps are usable pre-1.3. I don’t recall the other things.
Cool, looking forward to @oxinabox’s input. If this is true then it sounds like there could be a new and improved DataDeps (which btw I love) that utilizes the Artifacts mechanism…
Thanks for the mention of
DataDeps, I was not aware of it.
From what I can tell I see some differences (at least in the design idea?). Please feel free to correct me if I am wrong:
DataDeps can be used in the REPL, artifacts have to be part of a module.
- Artifacts are garbage collected, so once a given artifact is not references anywhere anymore, it gets automatically removed after 30 days (or less if so configured)
- In terms of difficulty, artifacts seem significantly harder to set up, but that might be just be inexperience.
I think for my use-case,
DataDeps is more appropriate right now. However, the question of secure storage/access is still a problem.
- I suppose one could take onedrive-sdk-python, write a PyCall.jl wrapper around it, and create a nice API and extend the relevant methods in
DataDeps. (Probably alot of effort)
- Alternatively, use a different Cloud Storage provider. Does anyone know a service that is nicely accessible, perhaps via SSH keys?
Are there any packages out there yet that use the artifacts just for data storage? I’ve read the original blog post about using artifacts but I’m still not confident that I could figure it all out without breaking something.
The Yggdrasil Repo mentions
We encourage Julia developers to use JLL packages for their libraries. Here are a few examples of pull requests switching to JLL package:
I think AWS and Azure (probably other Enterprise cloud storage places) can be connected to with ssh keys, but probably don’t have free tiers. Depending on how sensitive the data is, you couple maybe do security through obscurity and have a one drive link that’s not blocked but not public (in Google drive the setting is “anyone with the link can access,” I don’t know about one drive), and then just treat the link like an ssh key
Yes, one of the this DataDeps lets you do that Artifacts intentionally doesn’t is customize your transport mechanism via setting the fetch method. Or via having a path-like type that overloads
Its relatively easy to add support for downloading via some other means.
You just set the
fetch_method in the registration block.
Something like shelling out to the linux secure copy tool looks like:
fetch_method = (local_dir, remote_path) -> run(`scp myserver:$remote_dir $filepath)
Or you use a type that overloads
AWSS3.jl which works with DataDeps and with AWS’s auth system out of the box.
If you’ve used AWS S3 before that would be my go to choice, that gets used all the time at Invenia.
There is a more proof of conceppt example in
PyDrive.jl for how to use PyCall to access GoogleDrive with their auth system and DataDeps.jl
Artifacts has some lessons learned from DataDeps. So there are many overlaps.
DataDeps is more flexible it also allows custom post-processing so you can use any file format not just tarballs. (and you don’t have to have all data inside tarballs).
Artifacts uses tree-hashs opf the post unpacking file structure which means they can check artififacts decompressed right.
Where as DataDeps only uses a hash of the file downloaded so can only check if the download worked.
Artifacts use content addressing so can’t ever run into a name clash.
I also would like to know if anyone has used Artifacts for data in the wild.
As we get more dedicated artifacts for just data it would be nice to keep track of this somewhere. This is really helpful when trying to test or document a package. Perhaps this could be the first “Julia Task View” (like the suggested CRAN task view idea from another thread here).