DataDeps post_fetch_method

I have a bunch of zipped .csv files on an S3 repo. I’m using the excellent DataDeps to retrieve them. Next, I use the also excellent JuliaDB to loadtable the .csvs into tables, and then from there it’s processing etc etc etc.

It occurred to me that I could define a new post_fetch_method method for DataDeps, that

  1. retrieves the zipped file from the S3 repo, unpacks it
  2. loadtables the .csvs to tables
  3. saves those tables in binary format with JuliaDB.save
  4. and deletes the downloaded zipped file and the inflated .csv files.

But all the inflated files and everything else, including the folder in ~/.julia/datadeps/ gets deleted…

Anyone with some bright ideas?

BTW, the reason I don’t keep the information in binary from the get-go and insist on having it online as a .csv text file is just for compatibility.

Can you post your code?
That all sounds sane and doable.

(In the long term I have a plan for a new version of DataDeps that makes it easier to do this kind of thing, see DataDepsPaths.jl for the sketch)

So quick!!! Thanks, gonna check out DataDepsPaths. Here’s the code (redacted some of the sensitive stuff):

using DataDeps, JuliaDB, UUIDs, Dates
function csv2db(f)
    # unpack
    DataDeps.unpack(f)
    repo = datadep"database"
    # load the csv file into a JuliaDB.table
    data = loadtable(joinpath(repo, "data.csv"), indexcols = [1])
    # convert some column types
    data = setcol(data, :col1 => :col1 => UUID, :col2 => :col2 => Nanosecond)
    # save the table to binary
    save(data, joinpath(repo, "data.jldb"))
    # remove the csv file
    rm(joinpath(repo, "data.csv"))
end
register(DataDep("database", "the database", "https://s3.address/database.zip", post_fetch_method = csv2db))

That looks fine to me.
unpack takes an argument as to if to delete the archive (default true).
But that is fine as you want to.

Also the post_fetch_method is executed from the DataDep’s directory so the joinpath stuff isn’t needed.

So the following should be the same.

function csv2db(f)
    # unpack
    DataDeps.unpack(f)
    # load the csv file into a JuliaDB.table
    data = loadtable("data.csv", indexcols = [1])
    # convert some column types
    data = setcol(data, :col1 => :col1 => UUID, :col2 => :col2 => Nanosecond)
    # save the table to binary
    save(data, "data.jldb")
    # remove the csv file
    rm("data.csv")
end

Not sure what I’m doing wrong, but:

ERROR: IOError: unlink: no such file or directory (ENOENT)
Stacktrace:
 [1] rm at ./file.jl:245 [inlined]
 [2] csv2db(::String) at ./REPL[60]:24
 [3] #18 at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:122 [inlined]
 [4] cd(::getfield(DataDeps, Symbol("##18#19")){typeof(csv2db),String}, ::String) at ./file.jl:96
 [5] run_post_fetch(::typeof(csv2db), ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:119
 [6] #download#13(::String, ::Nothing, ::Bool, ::Function, ::DataDep{String,String,typeof(DataDeps.fetch_http),typeof(csv2db)}, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:84
 [7] download at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:70 [inlined]
 [8] handle_missing at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:10 [inlined]
 [9] _resolve(::DataDep{String,String,typeof(DataDeps.fetch_http),typeof(csv2db)}, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:83
 [10] resolve(::DataDep{String,String,typeof(DataDeps.fetch_http),typeof(csv2db)}, ::String, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:29
 [11] resolve(::String, ::String, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:54
 [12] resolve(::String, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:73
 [13] top-level scope at none:0

I’ll try to set up a complete MWE…

OK… This MWE actually works:

using DataDeps, JuliaDB, UUIDs, Dates
function csv2db(f)
    # unpack
    DataDeps.unpack(f)
    # load the csv file into a JuliaDB.table
    data = loadtable("database.csv", indexcols = [1])
    # convert some column types
    data = setcol(data, :uuid => :uuid => UUID, :ns => :ns => Nanosecond)
    # save the table to binary
    save(data, "database.jldb")
    # remove the csv file
    rm("database.csv")
end
register(DataDep("database", "the database", "https://s3.eu-central-1.amazonaws.com/vision-group-file-sharing/Fun%20Stuff/database.zip", post_fetch_method = csv2db))
datadep"database"

So I am not sure what exactly isn’t working in the real one. The only two differences between the MWE and the real deal are 1) using @eval to save all the tables, 2) removing a whole folder that contains many .csvs:

    for x in (:videofile, :video, :interval, :poi, :board, :calibration, :run, :experiment, :pixel_coord)
        @eval save($x, $("$x.jldb"))
        rm("$x.csv")
    end
    rm("pixel", recursive = true)

hmmm…

OK, I found what I was doing wrong. Totally irrelevant to DataDeps. Sorry for the noise and thanks for the help!!!

For the intrested, in:

I’m saving the pixel_coord table in the for-loop, but I’m also attempting to rm the pixel_coord csv file, which doesn’t exist (those are all in a pixel directory - which I later remove recursively).

1 Like