How to know if a package is good?

Since scipy was mentioned quite often, there is GitHub - AtsushiSakai/SciPy.jl: Julia interface for SciPy which technically makes Julia as good as Python when using scipy :troll:

As for courses, creating a curated list of packages useful for a course in a Project.toml and Manifest.toml files is not difficult. If you are teaching the course and you choose Julia as the programming language to recommend, the burden of finding which tools to use in Julia lies mostly on you as the instructor and not on the students. So if the students are spending too much time finding the right package, I think a cheat sheet of useful packages might be useful to add to the course material. Perhaps some of the course preparation time can also go into improving the documentation of the packages to be used in the course or adding more examples and tutorials.

As for finding good packages, I think test and documentation coverage are 2 things to look for. Maybe we need a tool for quantifying how much of the exported API is covered in the docs.

7 Likes

In general, the problem of identifying if a package is good is similar to the problem of identifying if a research paper is good. If a paper (package) has many citations (stars), is it necessarily good? If a paper (package) has authors from a big university, is it necessarily good? If a paper (package) is in a reputable journal (organisation), is it necessarily good?

On the other hand, if a paper (package) is on arxiv (personal GitHub account), the author is not from a big university and it doesn’t have many citations (stars), is it a bad paper (package)? I think most of us would agree that this would be an unfair evaluation in general. But at least we know that this paper (package) went through a less rigorous review process to get to its current state. People’s trust is a funny thing and ultimately non-autonomous students will probably trust their professor more than any metric. So if a professor recommends a random paper on arxiv, they will regard it highly and trust it. If the professor recommends a random package on GitHub, they will also do the same. Autonomous students will read the paper (package’s src, tests and documentation) and make up for themselves if they think this paper (package) is good for them and what needs improvement.

8 Likes

The context was packages shown on a website. I don’t think removing anything from those views will affect Pkg.

1 Like

Is Julia Packages being phased out? It seems a lot of the stats are outdated, e.g. the “Updated last” field is typically incorrect. For example, the website says that Flux was updated 1 year ago, but on Github the latest released tag is 9 days ago. Also, I think the stars are outdated, e.g. the website says that Pluto has 2874 stars, but on Github it says 3.9K stars. However, JuliaHub has the correct stats. What is the difference between these two sites anyways, they seem to be very similar but both are linked to in the Julia language website landing page?

Other than that, my 2 cents:

  1. I think the Julia ecosystem requires a different mindset than, e.g. Python’s numpy/pandas/scipy/matplotlib batteries-included-in-a-monolithic-package approach. The idea that you import only what you need is really quite cool, but for people coming from these other environments it is different enough to be challenging. Currently, I think the Julia language website already points people to the “big” packages in the Ecosystem portion of the landing page. I think this is actually enough to get one started, but maybe this new mindset should be emphasized there?
  2. For teaching students at university, I typically do give the students a cheat sheet, and found that that works. On a side note: I also found it is super important to explain carefully to students the advantages and disadvantages of JIT - newcomers or people used to Python tend to be skeptical when the load times are long, but if you explain why/how to mitigate it things are smoother.
  3. Maybe it is worth it to be even more forceful with the messaging around exposing newcomers to the community. I really think the Slack or Zulip channels are great media for people to engage and ask quick questions like “what package should I use for xyz” and get immediate feedback. Also, since most newcomers are in academia (fact check?), I think it is a great selling point because you can meet people and see cutting edge ideas being discussed in real time. Somehow the initial shyness just needs to be overcome.

Someone else mentioned the idea that anyone can develop a package in Julia and have it compete head-to-head with any of the big packages with 1000s of developers in other languages, because the two-language problem is absent in Julia. I think this is such an amazing trait and deserves to be emphasized to justify point (1) above even more!

11 Likes

I am sorry for that. I will be more careful next time exposing my opinion.

Have you considered that maybe your Google search results are already fine tuned to your profile? And that other students may be typing “Julia” for the first time in their search boxes?

Regarding this specific example with CSV files… I had students who preferred to follow the official Julia documentation and attempted DelimitedFiles instead. They had issues with the stdlib and thought that they were doing something wrong. After all, they thought: “this is a stdlib, I must be doing something wrong…”.

Thinking more broadly, I do believe that many searches on Google will point to outdated packages that are not touched in years. JuliaHub or any other community-driven platform could introduce metrics about the “healthy state” of a package, or at least introduce links to similar efforts to help make the redundancy explicit.

IMHO, as a leader of this community you could approach the discussion differently by asking more questions in the thread to guide the discussion towards improvements. When you try to find evidence that go against the issues raised, you are taking a position of disbelief, which then triggers a whole set of arguments that are not productive.

Start with the fact that multiple professors at different universities are sharing the same point of view. You may disagree with this point of view, but there is certainly an issue somewhere to be addressed. Now, what questions can you ask to identify the core issues? Trying to prove the opposite is not the best leading strategy.

Thank you for considering the proposal. I will try to implement the initial set of links using “Deep NLP” like I did in this paper: https://arxiv.org/pdf/1712.01476.pdf

The idea is to scan all the README (+ docs/src) files and represent packages as bags of words. An unsupervised algorithm will then find out the embedding of these words like the image below:

image

In this example, the algorithm figures out certain classes of words like “integers”, “issues”, “operations”. This information can then be used to assign a “subject” to each package or a “similarity of subject”.

I implemented the paper using Python a long time ago, but I will try to find the time in the following months to implement it in Julia using one of the many neural net frameworks available. Alternatively, I am happy to mentor students interested in learning these methods.

In the meantime, nothing blocks the development of a textbox menu in JuliaHub so that users could manually insert links in the similarity graph.

12 Likes

Yes, of course. The results are the same when logged out of google. It’s easy for anyone to try the exact same thing.

Regarding this specific example with CSV files… I had students who preferred to follow the official Julia documentation and attempted DelimitedFiles instead. They had issues with the stdlib and thought that they were doing something wrong. After all, they thought: “this is a stdlib, I must be doing something wrong…”.

DelimitedFiles should absolutely be removed as a stdlib… and it is: remove DelimitedFiles from being an stdlib by KristofferC · Pull Request #45540 · JuliaLang/julia · GitHub. It should also have some notes added to it indicating that CSV should generally be preferred. If someone wants to take something straightforward and actionable from this, adding those notes is a good action item. Although DelimitedFiles is occasionally good for simple files representing numerical matrices and such.

JuliaHub or any other community-driven platform could introduce metrics about the “healthy state” of a package, or at least introduce links to similar efforts to help make the redundancy explicit.

JuliaHub includes both GitHub star count and recent download counts as indicators of activeness. If you or anyone has ideas for better indicators, please feel free to propose them.

At a higher level, there’s a little disconnect in expectation here: I think it is neither the responsibility of nor an appropriate role for a third party commercial platform like JuliaHub to decide which packages people should prefer in the open source ecosystem. That is one of the reasons I’m advocating for people to create and establish open source, community driven graphs/metrics/whatever. If the community agrees that these are useful, then it’s fine for those to be included on JuliaHub and JuliaPackages and wherever else, but it’s just not a company’s place to dictate what packages “win” in the open source ecosystem.

IMHO, as a leader of this community you could approach the discussion differently by asking more questions in the thread to guide the discussion towards improvements. When you try to find evidence that go against the issues raised, you are taking a position of disbelief, which then triggers a whole set of arguments that are not productive.

I’m trying to determine what what concrete steps can be taken to improve things. Broad, generalized complaints don’t help improve anything. Since the general complaint here is that basic data science packages are hard to find, I tried what a new user would do when trying to figure out how to do things like reading a CSV file or plotting something. It’s unfortunate that you interpret this as me trying to “find evidence that go against the issues raised” or “taking a position of disbelief”. I’m trying to find specific things that can be improved. My specific findings are:

  • If a new user is trying to figure out how to load a CSV file, they are quite likely to learn the best way to do it by googling.
  • If they limit themselves to stdlibs, they may be mislead into trying to use DelimitedFiles, which does work for some kinds of data, but not general heterogenous tabular data. So it’s a good thing we’re removing that as a stdlib and notes should be added to the documentation referencing CSV as a probably preferable alternative.
  • If a new user is trying to figure out how to plot something, they are likely to find the official Plots tutorial, which is both a good package choice and a working, current tutorial.
  • Interpolations apparently doesn’t cover some basic use cases and may be a pain point (again, unclear to me as I’ve never really used this functionality). There is already prominent linkage to other interpolation packages, but perhaps that could be improved by someone who knows about this. It may also be possible to add whatever features new users are likely need to the Interpolations package.

There have been some responses along the lines of “but if I’m using Python then everything I need is in scipy/pandas/matplotlib”. The thing is I’m not sure what do with that. Sure that may be true, but this isn’t Python. (And sometimes it seems that the pandas CSV reader isn’t even the one you should use—you should be using PyArrow instead; this is according to the author of pandas, among others.) Julia doesn’t do monolithic superpackages like scipy. Some people may think that it should, but I’m not one of them. So it’s just unclear what the actionable aspect of this observation is, aside from trying to make sure that the “google what I want to do” approach works as well as it can.

Start with the fact that multiple professors at different universities are sharing the same point of view. You may disagree with this point of view, but there is certainly an issue somewhere to be addressed. Now, what questions can you ask to identify the core issues? Trying to prove the opposite is not the best leading strategy.

Yes, that’s why I’m here asking exactly those questions.

I will try to implement the initial set of links using “Deep NLP” like I did in this paper: https://arxiv.org/pdf/1712.01476.pdf

Awesome. I look forward to that.

In the meantime, nothing blocks the development of a textbox menu in JuliaHub so that users could manually insert links in the similarity graph.

This is actually the part that I think is a massive product effort and have no appetite for building. Sounds simlpe… “just a text box”. But any time you have humans giving input, it gets complicated. The text box has a lot of options, so it will need a search filter functionality. I’m sure there’s some fancy JavaScript widget someone has built that already does that well, but someone needs to research that and integrate it and make sure that it works nicely on the site. And any time there’s human input, there’s danger of gaming and spam and then you need systems to detect and deal with that. People will be proposing edits to a community knowledge base and as soon as you have that, you need a system to let people review and approve or reject those proposed edits. That means you need a notion of “community admins” who are allowed to review things. And moderation tools like allowing community admins to block people who keep proposing bad edits. And logging of all actions taken by admins. And features for platform admins to manage the community admins. It sounds simple, but it’s really a whole can of worms from a product perspective. It’s all doable, but it’s… complex. (And all that for a feature with zero commercial demand.)

It seems much more plausible to me if the “related/alternative packages” graph is maintained externally as a community resource. For example, it could be a TOML file that’s maintained on GitHub. Then changes would be proposed and reviewed using the same tooling that we already use for code and registries. Or even simpler: people propose edits to READMEs linking to other related packages.

16 Likes

From the constructive side: at Introduction · DataFrames.jl we try to maintain a community-driven list of relevant packages for DataFrames.jl users (as this package is likely to be at least quickly looked at by any new Julia user who wants to do data related work in Julia).

The intention of this list is to include: maintained and mature packages (of course it is subjective as the whole discussion in this thread shows).

Therefore:

  • if someone reading this feels something should be added to this list please open a PR/issue.
  • similarly if you find some information provided there outdated please open a PR/issue.

Every such PR/issue will go through a standard review process by at least two members of JuliaData organization.

15 Likes

I’ve always found looking at the dependents of a package (ie. the packages using it) to be interesting.
This is already being maintained by JuliaHub. Perhaps a summary could be promoted to the search results page. Of course this is mostly relevant for “middleware” packages.

1 Like

It isn’t immediately clear that JuliaHub is a “third party commercial platform”, as on the Julia homepage under Packages there’s a link “over 7,400 Julia packages” which directly takes you to the package search page of JuliaHub. There is no mention of anything remotely commercial on that page, other than the “Powered by Julia Computing” in the footer, which is all the way down and easy to miss.

And of you happen to end up on JuliaHub it reads “ JuliaHub is the entry point for all things Julia: explore the ecosystem, contribute packages, and easily run code in the cloud on big machines and on-demand clusters.”, again not giving any clue that this is a commercial service.

This really blurs the lines between community-driven and commercial, and so it’s not really surprising that there’s a disconnect in expectations.

11 Likes

I agree that the distinction should be clarified. We’re in the middle of a redesign/revamp of JuliaHub as a landing page and will make sure it’s clearer what the role is.

8 Likes

Hmm, well just to give a countervailing anecdote, I’ve had students who want to read a super-simple csv into a plain Julia array, they google and hit CSV.jl, and then run into trouble because that seems to be roping them into the whole Tables.jl / DataFrames.jl ecosystem and they just want a darn Array. Whereas DelimitedFiles does exactly what they wanted.

(Nothing against CSV.jl at all, it’s a great package, but just in the context of DelimitedFiles getting maybe un-stdlib’d)

6 Likes

I think that might a bit overstated. I wrote something for someone in python a few weeks ago. They sent me a csv file. The first thing I did is google “python csv reader”, and I found the builtin thing. I don’t see how that makes me an advanced user… Maybe because I knew to type “csv”. So, now I google “read file in python”. Then I get a mix of stuff. A few mention pandas. One mentions excel. Some just any text file. I few include csv and the builtin package. I just don’t see this giant funnel showing me the annointed way.

1 Like

To anyone considering a reply to this issue, please consider starting a separate thread to discuss the un’stdlibfication of DelimitedFiles :pray:

This present thread has a more general topic of discussion.

3 Likes

Definitely. I have a distinct memory of when I was first learning using a combination of f.eachline() and line.split(',') to parse a CSV… Took me forever to figure out how to treat the header line differently…

Since the original post came from the perspective of an educator, let me comment in that context. (There are plenty of other important issues raised in this thread, but I don’t have the expertise to say anything useful.)

I’ve taught a few mathematics courses at the college level with a programming component. For these I used Python (though I hope to use Julia in the future).

Based on my own experience, and talking to other instructors, it seems like the primary factor guiding language adoption (in an educational context) is ease for use, both for the students and the instructor. My goal is to teach math, and to the extent I feel that assigning programming problems helps with this, I require programming/simulation/etc. But I’m always doing a cost–benefit analysis in the back of my head, wondering whether the benefits from programming assignments outweigh all the non-mathematical problems that students have to overcome to complete them (learning syntax, debugging packages, debugging language installations, etc.).

If a student feels they spend more time wrestling with the language (or package choice, or poorly supported/broken packages) than doing math, then that’s a bad experience for them. And if I have to answer a ton of incidental language/package questions, which really have nothing to do with the main content I’m trying to teach, that’s a bad experience for me.

So, at the beginning of class, I give a handout on Python with a 30-minute quick start guide to the language and a few package suggestions for basic tasks (e.g. seaborn for plotting). This way they don’t have to Google how to do basic things like plot, or attempt to judge the merits of various libraries; they can just focus on the mathematical content of the course.

I think this is a pretty common approach. For example, in companion site to the book “Fundamentals of Numerical Computation with Julia,” the authors give a few package suggestions for students (along with installation instructions, etc.). See here: GitHub - fncbook/FundamentalsNumericalComputation.jl: Core functions for the Julia (2nd) edition of the text Fundamentals of Numerical Computation, by Driscoll and Braun.. They also standardize on the Plots package in the book.

In light of these considerations, and to respond more directly to the original post: I feel it would be good if people teaching undergraduates could standardize on a few simple packages that are easy to use and bug free, just for the purpose of teaching. They don’t need to be the fastest or most sophisticated, they just need to minimize the number of headaches for the students and instructor.

For more advanced users with more specialized needs, of course other packages may be more useful. But my students and I are not advanced users; we’re just coding up basic simulations/examples to illustrate lecture content.

14 Likes

I have been wondering if there is some value in a data loading library which brings together a bunch of other packages under one API. For example you could have

load_table(DataFrame, "some/file.csv")

which loads the given file as a DataFrame (or whatever table type you like) and it understands many formats (CSV, TSV, various JSON formats, Parquet, ARFF, etc, etc, could even uncompress things too for you).

Similarly load_matrix (CSV, binary, numpy, etc) and load_unstructured (JSON, Serialisation, YAML, etc).

It would have some limited options, but not much, instead pointing you at other libraries.

This targets the “I just want to get a smallish table loaded without worrying about the format” niche.

Would this add value, or just be the N+1th package for data loading?

4 Likes

FileIO does that I think

1 Like

I’ve never found FileIO to be very useful in the past because it is too general and you don’t know what will be returned. Even for the same file format, the return type of load depends on what packages you have installed.

What I proposed was more restrictive: the return type is independent of the file format.

5 Likes

Yes, I think this is a good idea. Potentially this could just be a wrapper for FileIO that does the necessary conversions?

Perhaps we should discuss in another thread, or Slack, since this thread is already quite long.

1 Like

I don’t think that is such a bad thing. Unless proposals are motivated by specific use cases, it is easy to get meandering discussions that go nowhere, because expectations diverge.

If I understand correctly, your specific problem that prompted this discussion is the following:

  1. you are teaching a course, which uses some programming,
  2. at the same time, you don’t want to spend excessive course time on programming,
  3. and you especially don’t want students go get distracted by the specifics of loading data

In this particular case, IMO the best solution would be to

  1. specify which packages to use for details that are incidental for the course,
  2. share a small example (eg a Jupyter notebook) that demonstrates them

Generally, I think that the idea of selecting the “best” general package for a some purpose is an illusion: when maintained alternatives exist, they are usually there to address trade-offs.

The problem is not unlike asking for a list of “good books”: you can find such lists on the internet, but for most people they of course miss a lot of books that they really enjoyed, or found transformative.

A very important trade-off is maintainability: when “recommended” packages are selected by a third party, it is very easy for them to become a large monolithic mess that is technically maintained, but only for small patches; innovation no longer happens there because at some point breaking things becomes more and more difficult. This happened to quite a few packages in the Julia ecosystem already (and no, will not name them).

So I would rather answer the original question: how do I know if a package is good? Here are my heuristics that I decide whether to invest in package:

  1. recent activity. Of course this is a fuzzy term, and may not be applicable to small packages that do one thing and rarely need updates. But medium-size packages usually need some dusting off, and for Julia in particular CI and tooling require occasional minor changes to the repo. If these are missing, and there has been no activity for years, that’s a bad sign.

  2. open issues and recent issue activity. seeing issues that have been fixes recently is a good sign. long-term outstanding issues exist for large projects, so in itself that’s not a problem, but major outstanding issues without activity suggest that the package is dormant.

  3. open PRs without discussion. someone made a contribution and got no reply for months or years years = the project is dormant or dead.

  4. functioning test suite, CI and coverage. It is hard to give a general rule, but quality usually starts around at least 70% coverage for me.

  5. documentation. including an explanation of what the package does and how it is different from other, similar packages. If, in addition, functions have well-written docstrings, and code is well organized, that means that the package will get contributions.

But all of these are heuristics, and I can name exceptions to all points above. So I find combining these into any kind of a “package goodness metric” pretty misleading, that would do more harm than good, because it would give the illusion of having obtained some meaningful information (and we already have ML for that purpose :wink:)

11 Likes