DataDeps and Google Drive

question

#1

Hi all (and hopefully oxinabox)!

I’m trying to access data on a Google Drive using the excellent DataDeps.jl.

The drive has been shared with me (so I don’t own it) and the data in it isn’t zipped (there are multiple folders in the shared parent folder). The data in it changes but very sporadically. While I could download the whole thing and work from there, it would be better if I could “access” it with DataDeps. I then could delete the local repository once in a while to keep it fresh.

Has anyone managed to get DataDeps to work with Google Drive?

Thanks!


#2

I take it you don’t have a public URL for it?
It is private, authed to you?
I am sure this can be made to work.
By replacing the fetch_method with something that does auth
I just am not sure exactly how.

Of course you can always use a ManualDataDep and external syncing
Which while it misses out on most of the advantages it does mm eave you an easy path forward once we workout how to automate it, since only the registration block would change


#3

If you are on Linux easiest might be to make a function that wraps run(`grive2 -s $remote`)
Plus some combination of cd and mv to make it work roughly like download(remote, local).

http://yourcmc.ru/wiki/Grive2

Or maybe this command line too.
I’ve not used but seems better maintained than grive2 https://github.com/odeke-em/drive/


#4

This seems like a totally acceptable solution for now. I’ll do that, to start with.

This will need to work on all three platforms.

Thanks @oxinabox! Ping here again if and when a complete solution comes up!


#5

So I did some poking,
it is actually really easy to download a file from drive IF you can sort out auth.
You can get the ID out of the webpage (inspect element shows it as a data field in the list of files. And you can get it a few other ways).
Which basically gives you a URL
See: https://developers.google.com/drive/api/v3/reference/files/get

Problem is setting up Auth
OAuth 2.0 is just a really annoying process to set up.
I mean it is as nice as it can be while still being secure, but that is not nice.
If we had a good Outh 2.0 library, this would be doable.
I am not aware of once.


#6

Thanks @oxinabox!


#7

I found this kind by accident
It might be possible to RCall it.
Then wrapping it’s drive_download into a fetch_method
would be easy
https://googledrive.tidyverse.org/index.html


#8

Wow, that sounds like a quick solution. I’ll look into that. Thanks!


#9

Tried it. It most probably could be made to work, but:

  1. googledrive has libcurl4-openssl-dev as a dependency, so build dependencies need to include that.
  2. there is no native solution for recursively downloading a folder (except this code snippet).
  3. because of the hell that is native Google file types, a folder can contain two files with identical names (google-drive-ocamlfuse solves this by slapping an extra file extension on the ambiguous ones, e.g. a file.csv that is a google spreadsheet becomes file.csv.csv when downloaded). If trying to download such identical files it throws an error.

#10

I guess that ideally we would have some Julian API to deal with google drive (and other such popular solutions) and once that API is solid DataDeps could use that to provide the same functionality it has for other repos. The API R has for google drive is lacking (e.g. we can’t even download a folder). So I’m not sure you want to build and rely on that. But I guess it’s better than nothing at all.


#11

Right, concept proved.
This works with DataDeps.jl
It wraps PyDrive.

It throws a lot of warnings, because DataDeps.jl has a kind of assumption that remotes would be HTTP URLS represented as Strings, but its fallbacks kick in and so does deal with it.
Though it seems like there over tight constraints are a bit of a minor bug.

Another thing is that while the download of the Files is lazy,
the download of the file names in the registration block is eager,
I think the simple solution to that is to use some kind of LazyVector type,

https://github.com/oxinabox/PyDrive.jl/blob/master/source/proto.ipynb

Right now it is almost all in that notebook rather than in a proper julia repo.
If you or someone else want to take PyDrive wrapping stuff and make it into a proper julia repo, that would be cool.
idk when (/if) I’ld have time to work on this again.

Having the concept proven is pleasing to me.


DataDeps itself doesn’t deal with the idea of downloading a folder very well.
Because most of the time when you want to download a folder that means downloadng a tarball or a .zip which keeps structure. Otherwise you are downloading a collection of files without structure to where they end up (except as imposed by applying mv as a post0fetch method)


#12

Wow! Great work! This seems like the best version of how to make this work!


#13

I’m trying to understand how you mean for people to use this. Do you mean that people that want to use DataDeps with a Google Drive will need to include PyDrive.jl (after we clean it up and all) for it to work?
Kind of like this?

using PyDrive, DataDeps
register(DataDep("GoogleDriveDemo",
        "Demonstration of google drive",
         list_files_in_folder("Demo"),
         fetch_method = drive_download));

#14

Yes, exactly.

Optionally put Any in the checksum argument position to bypass the checksum checking,
and suppress that warning about checksum not given

They could also use PyDrive.jl for other things without DataDeps.jl being in it.
If it supported that


#15

Yea, so PyDrive will be that google drive API I mentioned before. And DataDeps will be able to work with that. Nice.