[ANN] UrlDownload.jl

UrlDownload.jl is a small packaged aimed to simplify the process of data download and conversion to the necessary format.

Features

  1. In-memory data processing, no intermediate files stored.
  2. Supports csv, json, feather, various image formats with basic autodetection ability.
  3. Allow usage of custom parsers for all other data formats.
  4. Use ProgressMeter.jl to show the status of the download process.

Examples

  1. Basic usage
using UrlDownload

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/test.feather"
df = urldownload(url)

# 2Γ—2 DataFrame
# β”‚ Row β”‚ x     β”‚ y     β”‚
# β”‚     β”‚ Int64 β”‚ Int64 β”‚
# β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
# β”‚ 1   β”‚ 1     β”‚ 2     β”‚
# β”‚ 2   β”‚ 3     β”‚ 4     β”‚
  1. Progress meter
using UrlDownload

url = "https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-December-2019-quarter/Download-data/business-price-indexes-december-2019-quarter-csv.csv"

urldownload(url, true)
# Progress: 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                      | Time: 0:00:01
  1. Custom parsers
using UrlDownload
using DataFrames
using CSV

url = "https://raw.githubusercontent.com/Arkoniak/UrlDownload.jl/master/data/ext.csv"
res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))
# 2Γ—2 DataFrame
# β”‚ Row β”‚ x     β”‚ y     β”‚
# β”‚     β”‚ Int64 β”‚ Int64 β”‚
# β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
# β”‚ 1   β”‚ 1     β”‚ 2     β”‚
# β”‚ 2   β”‚ 3     β”‚ 4     β”‚
9 Likes

You might also be interested in HTTP.download and TerminalLoggers Progress Bars

using HTTP, TerminalLoggers, Logging

# Set logger to a Terminal logger which thus supports progress bars
global_logger(TerminalLogger(right_justify=120))

url = "http://ipv4.download.thinkbroadband.com/10MB.zip"
HTTP.download(url)

asciicast

This might be useful to simplify your implementation, if you were so inclined.

3 Likes

Can you tell me what the difference betweeen:

res = urldownload(url, parser = x -> DataFrame(CSV.File(IOBuffer(x))))

and:

res = DataFrame(CSV.File(IOBuffer(urldownload(url))))

is?

Is the point to avoid triggering the autodetection?


You might also be interested in FileIO.jl which is all about autodetection.
Possibly you could use this to support autodetection on a ton more types?

You are almost correct, the second version should look like

res = DataFrame(CSV.File(IOBuffer(urldownload(url, parser = identity))))

and there will be no difference in the outcome. It’s just a matter of personal taste which form to use.

Regarding FileIO.jl, it was my initial approach, unfortunately, FileIO.jl (or perhaps libraries that use it) are not very friendly to IOBuffer data types. For example, this bug was fixed only recently, there is no support for JSON files and so on. In the end, it was easier to avoid FileIO.jl altogether. Maybe at some point, I’ll return FileIO but it requires lots of testing.

urldownload differs from Base.download by storing everything in memory, apart from parsers, progress bars, etc. ?

Does it rely on curl, wget, fetch, etc., like Base.download?

It relies currently on HTTP.jl, which as far as I understand is pure Julia without any of curl, wget and so on.

1 Like

Guys, a related question, do you know how can I test if a link is valid? I have a list of URLS and I want to download them all to disk, but I cant make a loop, since if the link isbroken it sends an error.

Couldn’t you just do standard error handling with try catch?

1 Like

You should use try catch block. For example

using UrlDownload
using Sockets
using DataFrames

urls = ["https://badurl", "https://www.stats.govt.nz/assets/Uploads/Electronic-card-transactions/Electronic-card-transactions-May-2020/Download-data/electronic-card-transactions-may-2020-csv.zip"]
res = DataFrame[]
for url in urls
    try
        push!(res, urldownload(url, true) |> DataFrame)
    catch er
        if er isa Sockets.DNSError
            println("url $url does not exists")
        else
            rethrow()
        end
    end
end

You should go through catch part and find out which errors you want to skip and which you want to raise error to.

2 Likes

Thank you all ! That was fast . @nilshg @Skoffer