UrlDownload.jl is a small packaged aimed to simplify the process of data download and conversion to the necessary format.
Features
- In-memory data processing, no intermediate files stored.
- Supports csv, json, feather, various image formats with basic autodetection ability.
- Allow usage of custom parsers for all other data formats.
- Use
ProgressMeter.jl
to show the status of the download process.
Examples
- Basic usage
using UrlDownload
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.feather"
df = urldownload(url)
# 2Γ2 DataFrame
# β Row β x β y β
# β β Int64 β Int64 β
# βββββββΌββββββββΌββββββββ€
# β 1 β 1 β 2 β
# β 2 β 3 β 4 β
- Progress meter
using UrlDownload
url = "https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-December-2019-quarter/Download-data/business-price-indexes-december-2019-quarter-csv.csv"
urldownload(url, true)
# Progress: 45%|ββββββββββββββββββββ | Time: 0:00:01
- Custom parsers
using UrlDownload
using DataFrames
using CSV
url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))
# 2Γ2 DataFrame
# β Row β x β y β
# β β Int64 β Int64 β
# βββββββΌββββββββΌββββββββ€
# β 1 β 1 β 2 β
# β 2 β 3 β 4 β
9 Likes
You might also be interested in HTTP.download
and TerminalLoggers Progress Bars
using HTTP, TerminalLoggers, Logging
# Set logger to a Terminal logger which thus supports progress bars
global_logger(TerminalLogger(right_justify=120))
url = "http://ipv4.download.thinkbroadband.com/10MB.zip"
HTTP.download(url)
This might be useful to simplify your implementation, if you were so inclined.
3 Likes
Can you tell me what the difference betweeen:
res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))
and:
res = DataFrame(CSV.File(IOBuffer(urldownload(url))))
is?
Is the point to avoid triggering the autodetection?
You might also be interested in FileIO.jl which is all about autodetection.
Possibly you could use this to support autodetection on a ton more types?
You are almost correct, the second version should look like
res = DataFrame(CSV.File(IOBuffer(urldownload(url, parser = identity))))
and there will be no difference in the outcome. Itβs just a matter of personal taste which form to use.
Regarding FileIO.jl
, it was my initial approach, unfortunately, FileIO.jl
(or perhaps libraries that use it) are not very friendly to IOBuffer
data types. For example, this bug was fixed only recently, there is no support for JSON
files and so on. In the end, it was easier to avoid FileIO.jl
altogether. Maybe at some point, Iβll return FileIO
but it requires lots of testing.
urldownload
differs from Base.download
by storing everything in memory, apart from parsers, progress bars, etc. ?
Does it rely on curl
, wget
, fetch
, etc., like Base.download
?
It relies currently on HTTP.jl
, which as far as I understand is pure Julia without any of curl
, wget
and so on.
1 Like
Guys, a related question, do you know how can I test if a link is valid? I have a list of URLS and I want to download them all to disk, but I cant make a loop, since if the link isbroken it sends an error.
Couldnβt you just do standard error handling with try catch
?
1 Like
You should use try catch
block. For example
using UrlDownload
using Sockets
using DataFrames
urls = ["https://badurl", "https://www.stats.govt.nz/assets/Uploads/Electronic-card-transactions/Electronic-card-transactions-May-2020/Download-data/electronic-card-transactions-may-2020-csv.zip"]
res = DataFrame[]
for url in urls
try
push!(res, urldownload(url, true) |> DataFrame)
catch er
if er isa Sockets.DNSError
println("url $url does not exists")
else
rethrow()
end
end
end
You should go through catch
part and find out which errors you want to skip and which you want to raise error to.
2 Likes
Thank you all ! That was fast . @nilshg @Skoffer