Dataset access - Python

@oxinabox will set me right here. I recall looking at

There is a Python standard for accessing datasets which was mentioned at the time.
Or am I imagining this? I ask since I Am working on a Python shop.

For standard you are probably linking the JSON-LD / thing.
Dataset - Type
Google AI Blog: Facilitating the discovery of public datasets

This is how DataDepsGenerators uses it:

For python tools:
Look at Quilt and frictionlessdata (actually for most languages including julia)
IIRC both are commercial concerns.
Niether can consume JSON-LD directly AFAICK
but frictionlessdata has their own competing standard
They seem like good folk.


I was just reviewing this package last night, looks great.
Quick question → Having trouble using the get_dataurls_from_webserver_index function in misc_extractors.jl
Was trying to use this on a public directory such as this:

What is proper syntax to use that function? Thanks!!

Apologies, scratch that, had tried to use import instead of using.

using DataDepsGenerators: get_dataurls_from_webserver_index

…and it works!

I suspect that one would benifit from a bit of generalizing:

function get_dataurls_from_webserver_index(datapage_url)
    datapage = getpage(datapage_url)

    data_hrefs = (attr(ele, "href") 
        for ele in eachmatch(sel"a", datapage.root)) 
        if !match(r"^(To )?Parent( Directory)?$"i, text_only(ele)) && !match(r"(To )?Index"i, text_only(ele)
    data_urls = joinpath.(datapage_url, data_hrefs)

Related is the issue to have a proper generator API based on using that interactively

Ok, cool! Put this in the misc_extractors file and reloaded kernel. (added as a ‘v2’ function, and had to add an end).
Tried to give it a spin but am getting an error:


UndefVarError: ele not defined
with the previous version, it produces:

AssertionError: Gumbo.HTMLElement{:a}:
<a href="/group/okstatestat">

Which is because that isn’t a valid URL to feed it for this function, will try and find a better example.
Many thanks for your consideration here!

probably a small typo