DataDeps post_fetch_method

yakir12 · May 27, 2019, 12:49pm

I have a bunch of zipped .csv files on an S3 repo. I’m using the excellent DataDeps to retrieve them. Next, I use the also excellent JuliaDB to loadtable the .csvs into tables, and then from there it’s processing etc etc etc.

It occurred to me that I could define a new post_fetch_method method for DataDeps, that

retrieves the zipped file from the S3 repo, unpacks it
loadtables the .csvs to tables
saves those tables in binary format with JuliaDB.save
and deletes the downloaded zipped file and the inflated .csv files.

But all the inflated files and everything else, including the folder in ~/.julia/datadeps/ gets deleted…

Anyone with some bright ideas?

BTW, the reason I don’t keep the information in binary from the get-go and insist on having it online as a .csv text file is just for compatibility.

oxinabox · May 27, 2019, 1:01pm

Can you post your code?
That all sounds sane and doable.

(In the long term I have a plan for a new version of DataDeps that makes it easier to do this kind of thing, see DataDepsPaths.jl for the sketch)

yakir12 · May 27, 2019, 1:03pm

So quick!!! Thanks, gonna check out DataDepsPaths. Here’s the code (redacted some of the sensitive stuff):

using DataDeps, JuliaDB, UUIDs, Dates
function csv2db(f)
    # unpack
    DataDeps.unpack(f)
    repo = datadep"database"
    # load the csv file into a JuliaDB.table
    data = loadtable(joinpath(repo, "data.csv"), indexcols = [1])
    # convert some column types
    data = setcol(data, :col1 => :col1 => UUID, :col2 => :col2 => Nanosecond)
    # save the table to binary
    save(data, joinpath(repo, "data.jldb"))
    # remove the csv file
    rm(joinpath(repo, "data.csv"))
end
register(DataDep("database", "the database", "https://s3.address/database.zip", post_fetch_method = csv2db))

oxinabox · May 27, 2019, 3:56pm

That looks fine to me.
unpack takes an argument as to if to delete the archive (default true).
But that is fine as you want to.

Also the post_fetch_method is executed from the DataDep’s directory so the joinpath stuff isn’t needed.

So the following should be the same.

function csv2db(f)
    # unpack
    DataDeps.unpack(f)
    # load the csv file into a JuliaDB.table
    data = loadtable("data.csv", indexcols = [1])
    # convert some column types
    data = setcol(data, :col1 => :col1 => UUID, :col2 => :col2 => Nanosecond)
    # save the table to binary
    save(data, "data.jldb")
    # remove the csv file
    rm("data.csv")
end

yakir12 · May 27, 2019, 4:06pm

Not sure what I’m doing wrong, but:

ERROR: IOError: unlink: no such file or directory (ENOENT)
Stacktrace:
 [1] rm at ./file.jl:245 [inlined]
 [2] csv2db(::String) at ./REPL[60]:24
 [3] #18 at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:122 [inlined]
 [4] cd(::getfield(DataDeps, Symbol("##18#19")){typeof(csv2db),String}, ::String) at ./file.jl:96
 [5] run_post_fetch(::typeof(csv2db), ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:119
 [6] #download#13(::String, ::Nothing, ::Bool, ::Function, ::DataDep{String,String,typeof(DataDeps.fetch_http),typeof(csv2db)}, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:84
 [7] download at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:70 [inlined]
 [8] handle_missing at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution_automatic.jl:10 [inlined]
 [9] _resolve(::DataDep{String,String,typeof(DataDeps.fetch_http),typeof(csv2db)}, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:83
 [10] resolve(::DataDep{String,String,typeof(DataDeps.fetch_http),typeof(csv2db)}, ::String, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:29
 [11] resolve(::String, ::String, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:54
 [12] resolve(::String, ::String) at /home/yakir/.julia/packages/DataDeps/LiEdA/src/resolution.jl:73
 [13] top-level scope at none:0

I’ll try to set up a complete MWE…

yakir12 · May 27, 2019, 4:15pm

OK… This MWE actually works:

using DataDeps, JuliaDB, UUIDs, Dates
function csv2db(f)
    # unpack
    DataDeps.unpack(f)
    # load the csv file into a JuliaDB.table
    data = loadtable("database.csv", indexcols = [1])
    # convert some column types
    data = setcol(data, :uuid => :uuid => UUID, :ns => :ns => Nanosecond)
    # save the table to binary
    save(data, "database.jldb")
    # remove the csv file
    rm("database.csv")
end
register(DataDep("database", "the database", "https://s3.eu-central-1.amazonaws.com/vision-group-file-sharing/Fun%20Stuff/database.zip", post_fetch_method = csv2db))
datadep"database"

So I am not sure what exactly isn’t working in the real one. The only two differences between the MWE and the real deal are 1) using @eval to save all the tables, 2) removing a whole folder that contains many .csvs:

    for x in (:videofile, :video, :interval, :poi, :board, :calibration, :run, :experiment, :pixel_coord)
        @eval save($x, $("$x.jldb"))
        rm("$x.csv")
    end
    rm("pixel", recursive = true)

hmmm…

yakir12 · May 27, 2019, 4:52pm

OK, I found what I was doing wrong. Totally irrelevant to DataDeps. Sorry for the noise and thanks for the help!!!

For the intrested, in:

yakir12:

for x in (:videofile, :video, :interval, :poi, :board, :calibration, :run, :experiment, :pixel_coord)
    @eval save($x, $("$x.jldb"))
    rm("$x.csv")
end
rm("pixel", recursive = true)

I’m saving the pixel_coord table in the for-loop, but I’m also attempting to rm the pixel_coord csv file, which doesn’t exist (those are all in a pixel directory - which I later remove recursively).

Topic		Replies	Views
Ingesting data to JuliaDB without .csv files Data question	4	1309	August 30, 2018
ANN: DataDeps.jl: BinDeps for Data Data	16	3176	April 17, 2018
Best way to download data from a remote URL directly into JuliaDB Performance	2	1912	April 25, 2018
JuliaDB Getting Started...with save error Data	6	1532	November 28, 2018
DataDeps.jl - replacing a dependency General Usage data	2	839	January 8, 2020

DataDeps post_fetch_method

Related topics