Using DataDeps for multiple files in multiple subdirectories

For my package, I want to download some data upon the user’s request and store them locally for future usage. A bit of research seems to indicate that DataDeps.jl by @oxinabox is the package to use, but I’m not sure if it can be used to solve my specific problem described below.

There is a website at https://website.com, and multiple data files are stored at

  • https://website.com/data/subA/A1.csv
  • https://website.com/data/subA/A2.csv

and

  • https://website.com/data/subB/B1.csv
  • https://website.com/data/subB/B2.csv

etc. The shared portion of all these paths is https://website.com/data. I would like to register this part as a DataDep and let the users to download and store the individual files in the individual subdirectories.

I hoped that some thing like

register(DataDep("Website.COM Data", "Data published in website.com", "https://website.com/data"))

and subsequent calls of

data_A1 = read(datadep"Website.COM Data" * "/subA/A1.csv")
data_A2 = read(datadep"Website.COM Data" * "/subA/A2.csv")
data_B1 = read(datadep"Website.COM Data" * "/subB/B1.csv")
data_B2 = read(datadep"Website.COM Data" * "/subB/B2.csv")

would download the files into ~/.julia/datadeps/Website.COM Data/, but it didn’t work.

Is there a way to achieve the goal described above using DataDeps or any other packages?

Correct DataDeps needs a full list of files to fetch.
You need to list them all in the register block.
(DataDepsGenerators.jl can help with this, some of the time)

RemoteFiles.jl
might be a suitable alternative that works better for this usecase.
I am not sure, I haven’t tried it

@oxinabox, thanks for your answer!

Could you explain how to register DataDep such that subfolders are created as mentioned in this documentation? In the documentation, I was not able to find an example describing the method to create subfolders. If I know how to create subfolders, I might be able to devise a method to achieve what I want.

I tried RemoteFiles, but it doesn’t seem to support creation of subfolders in the default location (the root directory of the package using RemoteFiles).

Here is an example

register(DataDep(
    "Pi3",
    "Some message",
    [
        "https://www.angio.net/pi/digits/10.txt",
        "https://www.angio.net/pi/digits/100.txt",
        [
           "https://www.angio.net/pi/digits/1000.txt",
           "https://www.angio.net/pi/digits/10000.txt",
           "https://www.angio.net/pi/digits/100000.txt"
        ]
    ],
    sha2_256,
    post_fetch_method = [
        # 1st applies to 1st file, i.e 10.txt
        filename -> mv(filename, joinpath(mkpath("ten"), basename(filename))),
        # 2nd applies to 2nd listed file, i.e 100.txt
        filename -> mv(filename, joinpath(mkpath("hundred"), basename(filename))),
        # Applies to all things in 3rd (the inner vector) ie. 1000.txt, 10000.txt, and 100000.txt)
        # alt could have written a vector of 3 function here to treat those differently
        filename -> mv(filename, joinpath(mkpath("lots"), basename(filename))),
    ]
))

readdir(datadep"Pi3")
readdir(datadep"Pi3/ten")
readdir(datadep"Pi3/lots")

Output at end is

julia> readdir(datadep"Pi3")
3-element Vector{String}:
 "hundred"
 "lots"
 "ten"

julia> readdir(datadep"Pi3/ten")
1-element Vector{String}:
 "10.txt"

julia> readdir(datadep"Pi3/lots")
3-element Vector{String}:
 "1000.txt"
 "10000.txt"
 "100000.txt"

In post_fetch_method you can run whatever code you like to derive the subfolder name from the filename. But the filename won’t have the subfolder embedded in it – blame RFC 6266 I guess.

1 Like

@oxinabox, thanks! This is closer to what I am trying to do.

I have one more question. When you have many files listed in one DataDep, it seems that reading one file from the DataDep downloads all the listed files. Is there a way to make it download only one file if that is the only file the user requests?

There is not

Thank you for the confirmation! I think I can live with the situation.