Permission for download datasets from other peoples websites

I’m developing a a package that downloads some datasets (using DataDeps) that are publicly available on the internet. I was wondering if there might me some legal or ethical problems with that.
Should I ask the authors first, before I publish my code? Or write them an email afterwards? And what if they don’t answer?

1 Like

Coding is a field of asking for forgiveness. If we begged for permission everytime, there wouldn’t be this little thing we call the internet.

// but seriously, post the datasets because they probably have some license attached


edit: I was kind of joking. You should probably send an email and reference them in your work. They might even give you more data

As another datapoint, when we created SNAPDatasets.jl, I got Jure’s permission in advance to reproduce (some of) the graphs.

1 Like

To clarify, it is not for research purpuse. What I want is a Julia package, where you can download public available datasets such as the ones at Combinatorial Data. (see MLDataset.jl for an example of such a package)

I use the package DataDep.jl for that. With DataDeps, you don’t store the dataset in your repository, but you store the url to the dataset. When a user asks for a dateset for the first time, they get a prompt, where I can show a message with information on the dataset and the authors. Then the user has to manually confirm and the dataset is then locally cached on the users hard drive.

1 Like

From that website, this sentiment seems like he would not care:

I am starting to make some of my large collection of combinatorial data available. See here for what is available so far.

Also, I’m guessing as this guy made nauty, a similar attitude would follow from his readme?

// summarizing, i think he’s cool with someone using his data. if you emailed him, you’d probably make his day

IANAL but it is my belief that if someone puts something online, then absence any statements to the contrary, they intend for it to be downloaded.

And that it makes no difference if it is downloaded using FireFox, Chrome, curl or DataDeps.jl.

I believe it is a worthy distinction that you are not redistributing them, only linking to them.
Some people certainly do not like it if you redistribute there work, but linking is much less objectionable.

As you know
The prompt DataDeps shows is there so you can give plenty of credit and data provenance information,
and a EULA (some data definitely comes with a EULA).

I think the ideal prompt generally looks something like the prompt DataDepsGenerators.jl includes in its generated code

2 Likes

I think that making the data available for download without any password protection or anything implies free use permissions. Make sure you accurately cite the source though.

Counterexamples include anything with an explicit license to the contrary.

1 Like

Fine. I’d amend my sentence to say implies “common sense free use permissions”. If they wanted to make it available but on a restricted basis, they’d put some additional barrier in place such as available upon request or putting it behind a registration wall.

As long as he isn’t using their data and passing it off as his own and he isn’t using the data to make copious amounts of $$, I don’t think there’d be much to worry about.

Thanks for all the answers.
I see two reason why somebody does not want their data downloaded by a program:

  1. They want the user to explicitly look at their website, maybe because they have adds there or they see their website as an add for themselves.
  2. They are afraid of excessive bandwidth use. A good think is, that DataDeps.jl caches locally, so it will not be a problem if somebody puts the request for data in an infinite loop.

I think the following might be a good workflow:

  1. Check the website for a license - if there is any, act according to that
  2. Implement the code that downloads the dataset and publish that code.
  3. Afterwards, write a short email to the author and tell them what you are doing.