How to know if a package is good?

Since the original post came from the perspective of an educator, let me comment in that context. (There are plenty of other important issues raised in this thread, but I don’t have the expertise to say anything useful.)

I’ve taught a few mathematics courses at the college level with a programming component. For these I used Python (though I hope to use Julia in the future).

Based on my own experience, and talking to other instructors, it seems like the primary factor guiding language adoption (in an educational context) is ease for use, both for the students and the instructor. My goal is to teach math, and to the extent I feel that assigning programming problems helps with this, I require programming/simulation/etc. But I’m always doing a cost–benefit analysis in the back of my head, wondering whether the benefits from programming assignments outweigh all the non-mathematical problems that students have to overcome to complete them (learning syntax, debugging packages, debugging language installations, etc.).

If a student feels they spend more time wrestling with the language (or package choice, or poorly supported/broken packages) than doing math, then that’s a bad experience for them. And if I have to answer a ton of incidental language/package questions, which really have nothing to do with the main content I’m trying to teach, that’s a bad experience for me.

So, at the beginning of class, I give a handout on Python with a 30-minute quick start guide to the language and a few package suggestions for basic tasks (e.g. seaborn for plotting). This way they don’t have to Google how to do basic things like plot, or attempt to judge the merits of various libraries; they can just focus on the mathematical content of the course.

I think this is a pretty common approach. For example, in companion site to the book “Fundamentals of Numerical Computation with Julia,” the authors give a few package suggestions for students (along with installation instructions, etc.). See here: GitHub - fncbook/FundamentalsNumericalComputation.jl: Core functions for the Julia (2nd) edition of the text Fundamentals of Numerical Computation, by Driscoll and Braun.. They also standardize on the Plots package in the book.

In light of these considerations, and to respond more directly to the original post: I feel it would be good if people teaching undergraduates could standardize on a few simple packages that are easy to use and bug free, just for the purpose of teaching. They don’t need to be the fastest or most sophisticated, they just need to minimize the number of headaches for the students and instructor.

For more advanced users with more specialized needs, of course other packages may be more useful. But my students and I are not advanced users; we’re just coding up basic simulations/examples to illustrate lecture content.

14 Likes

I have been wondering if there is some value in a data loading library which brings together a bunch of other packages under one API. For example you could have

load_table(DataFrame, "some/file.csv")

which loads the given file as a DataFrame (or whatever table type you like) and it understands many formats (CSV, TSV, various JSON formats, Parquet, ARFF, etc, etc, could even uncompress things too for you).

Similarly load_matrix (CSV, binary, numpy, etc) and load_unstructured (JSON, Serialisation, YAML, etc).

It would have some limited options, but not much, instead pointing you at other libraries.

This targets the “I just want to get a smallish table loaded without worrying about the format” niche.

Would this add value, or just be the N+1th package for data loading?

4 Likes

FileIO does that I think

1 Like

I’ve never found FileIO to be very useful in the past because it is too general and you don’t know what will be returned. Even for the same file format, the return type of load depends on what packages you have installed.

What I proposed was more restrictive: the return type is independent of the file format.

5 Likes

Yes, I think this is a good idea. Potentially this could just be a wrapper for FileIO that does the necessary conversions?

Perhaps we should discuss in another thread, or Slack, since this thread is already quite long.

1 Like

I don’t think that is such a bad thing. Unless proposals are motivated by specific use cases, it is easy to get meandering discussions that go nowhere, because expectations diverge.

If I understand correctly, your specific problem that prompted this discussion is the following:

  1. you are teaching a course, which uses some programming,
  2. at the same time, you don’t want to spend excessive course time on programming,
  3. and you especially don’t want students go get distracted by the specifics of loading data

In this particular case, IMO the best solution would be to

  1. specify which packages to use for details that are incidental for the course,
  2. share a small example (eg a Jupyter notebook) that demonstrates them

Generally, I think that the idea of selecting the “best” general package for a some purpose is an illusion: when maintained alternatives exist, they are usually there to address trade-offs.

The problem is not unlike asking for a list of “good books”: you can find such lists on the internet, but for most people they of course miss a lot of books that they really enjoyed, or found transformative.

A very important trade-off is maintainability: when “recommended” packages are selected by a third party, it is very easy for them to become a large monolithic mess that is technically maintained, but only for small patches; innovation no longer happens there because at some point breaking things becomes more and more difficult. This happened to quite a few packages in the Julia ecosystem already (and no, will not name them).

So I would rather answer the original question: how do I know if a package is good? Here are my heuristics that I decide whether to invest in package:

  1. recent activity. Of course this is a fuzzy term, and may not be applicable to small packages that do one thing and rarely need updates. But medium-size packages usually need some dusting off, and for Julia in particular CI and tooling require occasional minor changes to the repo. If these are missing, and there has been no activity for years, that’s a bad sign.

  2. open issues and recent issue activity. seeing issues that have been fixes recently is a good sign. long-term outstanding issues exist for large projects, so in itself that’s not a problem, but major outstanding issues without activity suggest that the package is dormant.

  3. open PRs without discussion. someone made a contribution and got no reply for months or years years = the project is dormant or dead.

  4. functioning test suite, CI and coverage. It is hard to give a general rule, but quality usually starts around at least 70% coverage for me.

  5. documentation. including an explanation of what the package does and how it is different from other, similar packages. If, in addition, functions have well-written docstrings, and code is well organized, that means that the package will get contributions.

But all of these are heuristics, and I can name exceptions to all points above. So I find combining these into any kind of a “package goodness metric” pretty misleading, that would do more harm than good, because it would give the illusion of having obtained some meaningful information (and we already have ML for that purpose :wink:)

11 Likes