Provenance of data in scientific packages

I’m not sure how pervasive it is but I’ve noticed this issue a couple of times.

The issue is this: The provenance of data used to implement basic “scientific packages” is ambiguous. This could mean the source of the data is not document, or is poorly documented. Alternatively, this could mean the documented source is not a primary source (reference-able journal article.) Sometimes the references source is Wikipedia (and, while I love Wikipedia, it is not a primary source.) Maybe the data is archaic or is not from the best-available internationally respected source like IUPAC or a national metrology institute.

Furthermore, we need to be careful when we change the data (maybe based on a updated tabulation). This should be considered a “breaking change” from the perspective of semantic versioning. I want to know because changing data I depend upon is likely to change the results and may break tests in my packages. This is not to say we shouldn’t update the data when better data becomes available. We should but we should warn our users.

This is not a blanket criticism of all scientific packages in Julia - PhysicalConstants.jl, among others, does a great job. I’m not going to name names but to be taken seriously in the scientific community Julia needs to set a higher standard.

Personally, I’d prefer really small packages that address one-and-only-one data issue and whose provenance is well documented. For example an "AtomicWeights.jl’ that implements the best practices for atomic weights as described in IUPAC publications.

Editorial aside: There is a really bad habit in the Julia community of copying a Python package, re-coding it in Julia and calling it a day. Some of the packages I have issue with just copied the Python data.

I did not understand this:

Can you give concrete examples? Pretty much every Julia package I use was not re-written from Python. I use about 100 packages.

But I agree that some packages should be citing more scientific papers explicitly in documentation strings. However, I think this isn’t a Julia problem. It is open source problem. Python has it just as well.

1 Like

@Datseris I apologize. This comment was overbroad. Most Julia packages do not follow this model. However, I could name a handful that are problematic because they do.

DocumenterCitations.jl is an exceptional package that makes the process of including citations in your documentation more seamless. We have used it in a couple of projects and we are very satisfied. Moving forwards, I will be porting all documentation I curate to use this package.

Perhaps it is helpful for us as a community, when we encounter a package that we believe should be citing literature more, to simply recommend DocumenterCitations.jl as a tool that may make adding citations less of a time consuming process!

1 Like

I’ll have to give DocumenterCitations.jl a try. Another useful package is DataDeps.jl for downloading data sets straight from the source. It provides a nice mechanism to document the provenance and to ensure integrity.