DataDeps.jl - replacing a dependency

The docs say:

…you can’t force users to download new copies of your data using DataDeps. There are work arounds, such as using DataDeps.jl + deps/build.jl to rm(datadep"MyData", recursive=true, force=true every package update.

My situation is this: I have a package CurrentPopulationSurvey.jl that is using DataDeps to allow users to download/unzip/parse Census Bureau data that is released monthly. I currently have it set up such that each year is a single dependency. For previous years, this is fine because all 12 months have been released and are available for download, so the dependency won’t change. However, for the current year, the dependency will change 12 times before reaching its ‘final’ state.

Basically, I’m simply adding a new download link each month to the current-year’s dependency so I need a user’s registered dependency (only for the current year) to be removed/replaced whenever the code tries to register it, rather than the user having to go delete the dependency manually. Maybe @oxinabox or other users of DataDeps.jl can help point me in the right direction? I’m not really sure how to implement the workaround that is listed above and I would only want this to be implemented for the current year dependency anyways.

Yeah, DataDeps is not great for this use-case.

In general when one wants to expose a lot of data, I suggest doing what Embeddings.jl and CorpusLoaders.jl does,
and wrap your @datadep"DataDepName" in an object.
e.g.

struct CData
    year::Int 
end

function path(cdata::CData)
    year = cdata.year.
    if year = 2020
         datadep"CensusData 2020 v0"
    else
        @datadep_str "CensusData $year v12"
   end
end

Then you can update the registration block each month to a new name.

Deleting data programitcally is actually kinda hard.
Harder than I made it sound when I made that comment.
Its very easy for users to manually delete data without fear – in that its dine to literally delete any data you got from a datadep – it will just redownload.
I periodically just do bash> rm -rd ~/.julia/datadeps

Here is something that deletes old data if it used

"""
    cleanup()
Deletes partial year data
"""
functlion cleanup()
    for year in 2012:2019
        for ver in 1:11  # never delete v12
            name = "CensusData $year v$ver" 
            path=DataDeps.try_determine_load_path(name, @__DIR__)
            path === nothing && contine  # not download
            rm(path; recursive=true)
        end
   end
end

A very different approach that I’ve not considered much,
is to use DataDeps missing file handling.

If you specify a subfolder or file within a datadep: like datadep"Census Data 2020/Oct"
then DataDeps will complain about the file not being found and ask if you want to redownload.

So if your code naturally has a folder per month.
Or if you add another signaling folder
you could, in theory,
make use of this to update an existing datadep.

Its not really intended for that.
Its intended to help with the case that someone accidentially manually deleted a folder in a datadep.

DataDepsPaths.jl (i.e. DataDeps v2) will be better at this.
But I am unlikely to ever have time to work on that project.

1 Like

I suggest doing what Embeddings.jl and CorpusLoaders.jl does,
and wrap your @datadep"DataDepName" in an object.

I’m going to try this out. Thanks so much, I appreciate it! :smiley: