How to know if a package is good?

It’s all part of the same site and pulling it apart would be an effort. Moreover, search of public and private code is a commercially compelling feature that we are not interested in giving away. There’s also no need to open source any of it: if someone implements the proposed package similarity as a service or a package then it can be integrated easily.

2 Likes

You always have, and thank you for that.

Also, I think it is important to acknowledge massive progress in some of these ecosystems on Julia. Plots and DataFrames are pretty stable and have a critical mass of features, and the TTFP is good enough. VS Code is stable and has a sufficient number of features at this point.

I don’t think basic datascience tasks are the real issue here (and not sure how many of those R/python users you will attract anyways, but that isn’t my field).

For open-source competitors, the main issue is the numpy/scipy stack. The standard scientific computing requirements (interpolation, linear optimization, nonlinear optimization, quadrature, solving system of equations, solving a root of a univariate function, numerical linear algebra, and a few others) aren’t the “basics” for everyone, but for those having the most problems with the package ecosystem’s quality/feature coverage/discovery it seems to be a common theme. We have talked about discovery, but as Stephen pointed out, part of this is identifying packages with the most resources and being in “numpy/scipy/pandas” is an easy way to know it had a lot of resources and is actively maintained.

I brought up interpolation as just one example, but with some obvious exceptions (FFT, anything to do with solving differential equations, SparseArrays, etc.) that whole stack of features I mentioned is tough for users, and even identifying the top star’d package isn’t enough. I know the people maintaining a lot of the most star’d packages and respect them greatly, but they are sparse and can’t possibly keep these packages maintained without coordinated resources. Matlab/NAG are commercial, but what about scipy? This NEP 48 — Spending NumPy project funds — NumPy Enhancement Proposals probably tells you a lot of what you need to know… could be wrong, but suspect coordination on package/org funding made it happen.

Of course, there is a tradeoff for focus. Maybe the difficulty of trying to marshalling the ecosystem to ensure it is more accessible for scientific computing users with less programming experience who would otherwise use matlab or python+scipy are resources/energy better spent on other things.

I think it is the wrong test but since you don’t use python day-to-day I understand why it made sense to try. A python or matlab scientific computing user wouldn’t even consider googling which csv package to use because everything is built in to their basic environment where they import scipy, pandas, numpy, and matplotlib. (Also, while CSV.jl is a good top choice, lets not forget that CSV.jl broke for weeks earlier this year, etc. during which time it was terrible advice!)

Anyways, I am going to stop posting here. I understand where you are coming from completely and why you are puzzled. Thank you for everything you guys have done to make the best language for scientific computing the world has ever seen, and enabling some of the best packages in history for those topics. Hopefully this helps give some context about one part of the package ecosystem. Whether something mitigating these issues is feasible is a separate question.

7 Likes

That is exactly what I have done before I choose a package, and perhaps another addition is
6. Look for tutorials about all candidate packages and see which one is people most passionate about.

1 Like

I spent an hour or two back in 2018 listing the packages required for parity with Matlab’s standard library. Some of it’s out of date, but it might be a useful starting point for a beginner package recommendation reference document: Julia for MATLAB users club - #8 by stillyslalom

7 Likes

One small and seemingly simple change would be to remove packages that have never worked with Julia 1.x. Requiring that they have a Project.toml and/or are registered in the General registry should be close enough.

3 Likes

I don’t think that’s enough because then Pkg just throws a weird error. It would instead need to throw an error like “Package exists on Github but was never registered for Julia v1.x and is thus installable, meaning the package was removed as unmaintained” or something like that.

1 Like

Since scipy was mentioned quite often, there is GitHub - AtsushiSakai/SciPy.jl: Julia interface for SciPy which technically makes Julia as good as Python when using scipy :troll:

As for courses, creating a curated list of packages useful for a course in a Project.toml and Manifest.toml files is not difficult. If you are teaching the course and you choose Julia as the programming language to recommend, the burden of finding which tools to use in Julia lies mostly on you as the instructor and not on the students. So if the students are spending too much time finding the right package, I think a cheat sheet of useful packages might be useful to add to the course material. Perhaps some of the course preparation time can also go into improving the documentation of the packages to be used in the course or adding more examples and tutorials.

As for finding good packages, I think test and documentation coverage are 2 things to look for. Maybe we need a tool for quantifying how much of the exported API is covered in the docs.

7 Likes

In general, the problem of identifying if a package is good is similar to the problem of identifying if a research paper is good. If a paper (package) has many citations (stars), is it necessarily good? If a paper (package) has authors from a big university, is it necessarily good? If a paper (package) is in a reputable journal (organisation), is it necessarily good?

On the other hand, if a paper (package) is on arxiv (personal GitHub account), the author is not from a big university and it doesn’t have many citations (stars), is it a bad paper (package)? I think most of us would agree that this would be an unfair evaluation in general. But at least we know that this paper (package) went through a less rigorous review process to get to its current state. People’s trust is a funny thing and ultimately non-autonomous students will probably trust their professor more than any metric. So if a professor recommends a random paper on arxiv, they will regard it highly and trust it. If the professor recommends a random package on GitHub, they will also do the same. Autonomous students will read the paper (package’s src, tests and documentation) and make up for themselves if they think this paper (package) is good for them and what needs improvement.

10 Likes

The context was packages shown on a website. I don’t think removing anything from those views will affect Pkg.

1 Like

Is Julia Packages being phased out? It seems a lot of the stats are outdated, e.g. the “Updated last” field is typically incorrect. For example, the website says that Flux was updated 1 year ago, but on Github the latest released tag is 9 days ago. Also, I think the stars are outdated, e.g. the website says that Pluto has 2874 stars, but on Github it says 3.9K stars. However, JuliaHub has the correct stats. What is the difference between these two sites anyways, they seem to be very similar but both are linked to in the Julia language website landing page?

Other than that, my 2 cents:

  1. I think the Julia ecosystem requires a different mindset than, e.g. Python’s numpy/pandas/scipy/matplotlib batteries-included-in-a-monolithic-package approach. The idea that you import only what you need is really quite cool, but for people coming from these other environments it is different enough to be challenging. Currently, I think the Julia language website already points people to the “big” packages in the Ecosystem portion of the landing page. I think this is actually enough to get one started, but maybe this new mindset should be emphasized there?
  2. For teaching students at university, I typically do give the students a cheat sheet, and found that that works. On a side note: I also found it is super important to explain carefully to students the advantages and disadvantages of JIT - newcomers or people used to Python tend to be skeptical when the load times are long, but if you explain why/how to mitigate it things are smoother.
  3. Maybe it is worth it to be even more forceful with the messaging around exposing newcomers to the community. I really think the Slack or Zulip channels are great media for people to engage and ask quick questions like “what package should I use for xyz” and get immediate feedback. Also, since most newcomers are in academia (fact check?), I think it is a great selling point because you can meet people and see cutting edge ideas being discussed in real time. Somehow the initial shyness just needs to be overcome.

Someone else mentioned the idea that anyone can develop a package in Julia and have it compete head-to-head with any of the big packages with 1000s of developers in other languages, because the two-language problem is absent in Julia. I think this is such an amazing trait and deserves to be emphasized to justify point (1) above even more!

12 Likes

I am sorry for that. I will be more careful next time exposing my opinion.

Have you considered that maybe your Google search results are already fine tuned to your profile? And that other students may be typing “Julia” for the first time in their search boxes?

Regarding this specific example with CSV files… I had students who preferred to follow the official Julia documentation and attempted DelimitedFiles instead. They had issues with the stdlib and thought that they were doing something wrong. After all, they thought: “this is a stdlib, I must be doing something wrong…”.

Thinking more broadly, I do believe that many searches on Google will point to outdated packages that are not touched in years. JuliaHub or any other community-driven platform could introduce metrics about the “healthy state” of a package, or at least introduce links to similar efforts to help make the redundancy explicit.

IMHO, as a leader of this community you could approach the discussion differently by asking more questions in the thread to guide the discussion towards improvements. When you try to find evidence that go against the issues raised, you are taking a position of disbelief, which then triggers a whole set of arguments that are not productive.

Start with the fact that multiple professors at different universities are sharing the same point of view. You may disagree with this point of view, but there is certainly an issue somewhere to be addressed. Now, what questions can you ask to identify the core issues? Trying to prove the opposite is not the best leading strategy.

Thank you for considering the proposal. I will try to implement the initial set of links using “Deep NLP” like I did in this paper: https://arxiv.org/pdf/1712.01476.pdf

The idea is to scan all the README (+ docs/src) files and represent packages as bags of words. An unsupervised algorithm will then find out the embedding of these words like the image below:

image

In this example, the algorithm figures out certain classes of words like “integers”, “issues”, “operations”. This information can then be used to assign a “subject” to each package or a “similarity of subject”.

I implemented the paper using Python a long time ago, but I will try to find the time in the following months to implement it in Julia using one of the many neural net frameworks available. Alternatively, I am happy to mentor students interested in learning these methods.

In the meantime, nothing blocks the development of a textbox menu in JuliaHub so that users could manually insert links in the similarity graph.

13 Likes

Yes, of course. The results are the same when logged out of google. It’s easy for anyone to try the exact same thing.

Regarding this specific example with CSV files… I had students who preferred to follow the official Julia documentation and attempted DelimitedFiles instead. They had issues with the stdlib and thought that they were doing something wrong. After all, they thought: “this is a stdlib, I must be doing something wrong…”.

DelimitedFiles should absolutely be removed as a stdlib… and it is: remove DelimitedFiles from being an stdlib by KristofferC · Pull Request #45540 · JuliaLang/julia · GitHub. It should also have some notes added to it indicating that CSV should generally be preferred. If someone wants to take something straightforward and actionable from this, adding those notes is a good action item. Although DelimitedFiles is occasionally good for simple files representing numerical matrices and such.

JuliaHub or any other community-driven platform could introduce metrics about the “healthy state” of a package, or at least introduce links to similar efforts to help make the redundancy explicit.

JuliaHub includes both GitHub star count and recent download counts as indicators of activeness. If you or anyone has ideas for better indicators, please feel free to propose them.

At a higher level, there’s a little disconnect in expectation here: I think it is neither the responsibility of nor an appropriate role for a third party commercial platform like JuliaHub to decide which packages people should prefer in the open source ecosystem. That is one of the reasons I’m advocating for people to create and establish open source, community driven graphs/metrics/whatever. If the community agrees that these are useful, then it’s fine for those to be included on JuliaHub and JuliaPackages and wherever else, but it’s just not a company’s place to dictate what packages “win” in the open source ecosystem.

IMHO, as a leader of this community you could approach the discussion differently by asking more questions in the thread to guide the discussion towards improvements. When you try to find evidence that go against the issues raised, you are taking a position of disbelief, which then triggers a whole set of arguments that are not productive.

I’m trying to determine what what concrete steps can be taken to improve things. Broad, generalized complaints don’t help improve anything. Since the general complaint here is that basic data science packages are hard to find, I tried what a new user would do when trying to figure out how to do things like reading a CSV file or plotting something. It’s unfortunate that you interpret this as me trying to “find evidence that go against the issues raised” or “taking a position of disbelief”. I’m trying to find specific things that can be improved. My specific findings are:

  • If a new user is trying to figure out how to load a CSV file, they are quite likely to learn the best way to do it by googling.
  • If they limit themselves to stdlibs, they may be mislead into trying to use DelimitedFiles, which does work for some kinds of data, but not general heterogenous tabular data. So it’s a good thing we’re removing that as a stdlib and notes should be added to the documentation referencing CSV as a probably preferable alternative.
  • If a new user is trying to figure out how to plot something, they are likely to find the official Plots tutorial, which is both a good package choice and a working, current tutorial.
  • Interpolations apparently doesn’t cover some basic use cases and may be a pain point (again, unclear to me as I’ve never really used this functionality). There is already prominent linkage to other interpolation packages, but perhaps that could be improved by someone who knows about this. It may also be possible to add whatever features new users are likely need to the Interpolations package.

There have been some responses along the lines of “but if I’m using Python then everything I need is in scipy/pandas/matplotlib”. The thing is I’m not sure what do with that. Sure that may be true, but this isn’t Python. (And sometimes it seems that the pandas CSV reader isn’t even the one you should use—you should be using PyArrow instead; this is according to the author of pandas, among others.) Julia doesn’t do monolithic superpackages like scipy. Some people may think that it should, but I’m not one of them. So it’s just unclear what the actionable aspect of this observation is, aside from trying to make sure that the “google what I want to do” approach works as well as it can.

Start with the fact that multiple professors at different universities are sharing the same point of view. You may disagree with this point of view, but there is certainly an issue somewhere to be addressed. Now, what questions can you ask to identify the core issues? Trying to prove the opposite is not the best leading strategy.

Yes, that’s why I’m here asking exactly those questions.

I will try to implement the initial set of links using “Deep NLP” like I did in this paper: https://arxiv.org/pdf/1712.01476.pdf

Awesome. I look forward to that.

In the meantime, nothing blocks the development of a textbox menu in JuliaHub so that users could manually insert links in the similarity graph.

This is actually the part that I think is a massive product effort and have no appetite for building. Sounds simlpe… “just a text box”. But any time you have humans giving input, it gets complicated. The text box has a lot of options, so it will need a search filter functionality. I’m sure there’s some fancy JavaScript widget someone has built that already does that well, but someone needs to research that and integrate it and make sure that it works nicely on the site. And any time there’s human input, there’s danger of gaming and spam and then you need systems to detect and deal with that. People will be proposing edits to a community knowledge base and as soon as you have that, you need a system to let people review and approve or reject those proposed edits. That means you need a notion of “community admins” who are allowed to review things. And moderation tools like allowing community admins to block people who keep proposing bad edits. And logging of all actions taken by admins. And features for platform admins to manage the community admins. It sounds simple, but it’s really a whole can of worms from a product perspective. It’s all doable, but it’s… complex. (And all that for a feature with zero commercial demand.)

It seems much more plausible to me if the “related/alternative packages” graph is maintained externally as a community resource. For example, it could be a TOML file that’s maintained on GitHub. Then changes would be proposed and reviewed using the same tooling that we already use for code and registries. Or even simpler: people propose edits to READMEs linking to other related packages.

16 Likes

From the constructive side: at Introduction · DataFrames.jl we try to maintain a community-driven list of relevant packages for DataFrames.jl users (as this package is likely to be at least quickly looked at by any new Julia user who wants to do data related work in Julia).

The intention of this list is to include: maintained and mature packages (of course it is subjective as the whole discussion in this thread shows).

Therefore:

  • if someone reading this feels something should be added to this list please open a PR/issue.
  • similarly if you find some information provided there outdated please open a PR/issue.

Every such PR/issue will go through a standard review process by at least two members of JuliaData organization.

15 Likes

I’ve always found looking at the dependents of a package (ie. the packages using it) to be interesting.
This is already being maintained by JuliaHub. Perhaps a summary could be promoted to the search results page. Of course this is mostly relevant for “middleware” packages.

1 Like

It isn’t immediately clear that JuliaHub is a “third party commercial platform”, as on the Julia homepage under Packages there’s a link “over 7,400 Julia packages” which directly takes you to the package search page of JuliaHub. There is no mention of anything remotely commercial on that page, other than the “Powered by Julia Computing” in the footer, which is all the way down and easy to miss.

And of you happen to end up on JuliaHub it reads “ JuliaHub is the entry point for all things Julia: explore the ecosystem, contribute packages, and easily run code in the cloud on big machines and on-demand clusters.”, again not giving any clue that this is a commercial service.

This really blurs the lines between community-driven and commercial, and so it’s not really surprising that there’s a disconnect in expectations.

12 Likes

I agree that the distinction should be clarified. We’re in the middle of a redesign/revamp of JuliaHub as a landing page and will make sure it’s clearer what the role is.

8 Likes

Hmm, well just to give a countervailing anecdote, I’ve had students who want to read a super-simple csv into a plain Julia array, they google and hit CSV.jl, and then run into trouble because that seems to be roping them into the whole Tables.jl / DataFrames.jl ecosystem and they just want a darn Array. Whereas DelimitedFiles does exactly what they wanted.

(Nothing against CSV.jl at all, it’s a great package, but just in the context of DelimitedFiles getting maybe un-stdlib’d)

6 Likes

I think that might a bit overstated. I wrote something for someone in python a few weeks ago. They sent me a csv file. The first thing I did is google “python csv reader”, and I found the builtin thing. I don’t see how that makes me an advanced user… Maybe because I knew to type “csv”. So, now I google “read file in python”. Then I get a mix of stuff. A few mention pandas. One mentions excel. Some just any text file. I few include csv and the builtin package. I just don’t see this giant funnel showing me the annointed way.

1 Like

To anyone considering a reply to this issue, please consider starting a separate thread to discuss the un’stdlibfication of DelimitedFiles :pray:

This present thread has a more general topic of discussion.

3 Likes

Definitely. I have a distinct memory of when I was first learning using a combination of f.eachline() and line.split(',') to parse a CSV… Took me forever to figure out how to treat the header line differently…