Dataset access - Python

@oxinabox will set me right here. I recall looking at https://github.com/oxinabox/DataDepsGenerators.jl

There is a Python standard for accessing datasets which was mentioned at the time.
Or am I imagining this? I ask since I Am working on a Python shop.

For standard you are probably linking the JSON-LD / Schema.org thing.
Dataset - Schema.org Type
Google AI Blog: Facilitating the discovery of public datasets

This is how DataDepsGenerators uses it:
https://github.com/oxinabox/DataDepsGenerators.jl/tree/master/src/APIs/JSONLD

For python tools:
Look at Quilt and frictionlessdata (actually for most languages including julia)
IIRC both are commercial concerns.
Niether can consume JSON-LD directly AFAICK
but frictionlessdata has their own competing standard
They seem like good folk.

Thankyou.

I was just reviewing this package last night, looks great.
Quick question → Having trouble using the get_dataurls_from_webserver_index function in misc_extractors.jl
https://github.com/oxinabox/DataDepsGenerators.jl/blob/master/src/misc_extractors.jl
Was trying to use this on a public directory such as this:
https://download.bls.gov/pub/time.series/sm/

What is proper syntax to use that function? Thanks!!

Apologies, scratch that, had tried to use import instead of using.

using DataDepsGenerators: get_dataurls_from_webserver_index

…and it works!

I suspect that one would benifit from a bit of generalizing:

function get_dataurls_from_webserver_index(datapage_url)
    datapage = getpage(datapage_url)

    data_hrefs = (attr(ele, "href") 
        for ele in eachmatch(sel"a", datapage.root)) 
        if !match(r"^(To )?Parent( Directory)?$"i, text_only(ele)) && !match(r"(To )?Index"i, text_only(ele)
    ) 
     
    data_urls = joinpath.(datapage_url, data_hrefs)
end

Related is the issue to have a proper generator API based on using that interactively
https://github.com/oxinabox/DataDepsGenerators.jl/issues/3

Ok, cool! Put this in the misc_extractors file and reloaded kernel. (added as a ‘v2’ function, and had to add an end).
Tried to give it a spin but am getting an error:

get_dataurls_from_webserver_index_v2("https://data.ok.gov/dataset/health-care-cost-growth/")

Out:
UndefVarError: ele not defined
with the previous version, it produces:

AssertionError: Gumbo.HTMLElement{:a}:
<a href="/group/okstatestat">
  OKStateStat
</a>

Which is because that isn’t a valid URL to feed it for this function, will try and find a better example.
Many thanks for your consideration here!

probably a small typo