I have a bunch of zipped .csv files on an S3 repo. I’m using the excellent DataDeps to retrieve them. Next, I use the also excellent JuliaDB to loadtable the .csvs into tables, and then from there it’s processing etc etc etc.
It occurred to me that I could define a new post_fetch_method method for DataDeps, that
retrieves the zipped file from the S3 repo, unpacks it
loadtables the .csvs to tables
saves those tables in binary format with JuliaDB.save
and deletes the downloaded zipped file and the inflated .csv files.
But all the inflated files and everything else, including the folder in ~/.julia/datadeps/ gets deleted…
Anyone with some bright ideas?
BTW, the reason I don’t keep the information in binary from the get-go and insist on having it online as a .csv text file is just for compatibility.
So quick!!! Thanks, gonna check out DataDepsPaths. Here’s the code (redacted some of the sensitive stuff):
using DataDeps, JuliaDB, UUIDs, Dates
function csv2db(f)
# unpack
DataDeps.unpack(f)
repo = datadep"database"
# load the csv file into a JuliaDB.table
data = loadtable(joinpath(repo, "data.csv"), indexcols = [1])
# convert some column types
data = setcol(data, :col1 => :col1 => UUID, :col2 => :col2 => Nanosecond)
# save the table to binary
save(data, joinpath(repo, "data.jldb"))
# remove the csv file
rm(joinpath(repo, "data.csv"))
end
register(DataDep("database", "the database", "https://s3.address/database.zip", post_fetch_method = csv2db))
That looks fine to me. unpack takes an argument as to if to delete the archive (default true).
But that is fine as you want to.
Also the post_fetch_method is executed from the DataDep’s directory so the joinpath stuff isn’t needed.
So the following should be the same.
function csv2db(f)
# unpack
DataDeps.unpack(f)
# load the csv file into a JuliaDB.table
data = loadtable("data.csv", indexcols = [1])
# convert some column types
data = setcol(data, :col1 => :col1 => UUID, :col2 => :col2 => Nanosecond)
# save the table to binary
save(data, "data.jldb")
# remove the csv file
rm("data.csv")
end
using DataDeps, JuliaDB, UUIDs, Dates
function csv2db(f)
# unpack
DataDeps.unpack(f)
# load the csv file into a JuliaDB.table
data = loadtable("database.csv", indexcols = [1])
# convert some column types
data = setcol(data, :uuid => :uuid => UUID, :ns => :ns => Nanosecond)
# save the table to binary
save(data, "database.jldb")
# remove the csv file
rm("database.csv")
end
register(DataDep("database", "the database", "https://s3.eu-central-1.amazonaws.com/vision-group-file-sharing/Fun%20Stuff/database.zip", post_fetch_method = csv2db))
datadep"database"
So I am not sure what exactly isn’t working in the real one. The only two differences between the MWE and the real deal are 1) using @eval to save all the tables, 2) removing a whole folder that contains many .csvs:
for x in (:videofile, :video, :interval, :poi, :board, :calibration, :run, :experiment, :pixel_coord)
@eval save($x, $("$x.jldb"))
rm("$x.csv")
end
rm("pixel", recursive = true)
OK, I found what I was doing wrong. Totally irrelevant to DataDeps. Sorry for the noise and thanks for the help!!!
For the intrested, in:
I’m saving the pixel_coord table in the for-loop, but I’m also attempting to rm the pixel_coordcsv file, which doesn’t exist (those are all in a pixel directory - which I later remove recursively).