How to know if a package is good?

Regarding official recommendations about packages https://julialang.org/ gives mentions of packages in the “Ecosystem” section, in a nice way that points people to well-established packages, but which doesn’t have an exclusionary tone. In my opinion, fragmentation is not a big deal in Julia. As a professor, I try to provide working examples to student that also point them to packages that work well (which is not to say that alternative might also work well). Blogs, examples, and answering questions on discourse and other platforms seems to work pretty well to orient people, I think. The user community is helpful, friendly, and someone always seems to be able to answer informatively.

Probably, many of us can be more active in raising issues and making PRs. That’s what I’m planning to do after having read this thread over the last week or so.

3 Likes

I agree, but go-to packages are important, they create community. We are not enforcing them, we are just suggesting them to end-users. Any experienced user is free to pick a more experimental package or a package that meets their advanced needs regardless of how well-maintained it is.

Examples:

  • JuMP.jl is a go-to package. See the size of its community, the extensive documentation, and all related benefits. For an end-user this is super helpful. They will find more resources, more people to ask questions, etc.

  • Turing.jl is a go-to package. There are some cool alternatives? Yes. Should we start recommending the cool advanced alternatives to end-users? No.

  • DifferentialEquations.jl is a go-to package. You can try to work with other packages for differential equations if you feel that your research requires something special, but end-users will be much more productive using a well-maintained suite that has tons of contributors instead of experimental packages maintained by a few.

  • Etc.

11 Likes

Just to add to this discussion. Googling “julia csv” gives me

  1. Link to the CSV.jl docs
  2. Link to the CSV.jl Github repo
  3. Blog tutorial on using CSV.jl with dataframes.
  4. Another blog tutorial on using CSV.jl with dataframes.

Of course my results could be personalized by Google, but it does seem like Google has made a decision on the “standard” CSV package in Julia.

I would think that what might be worth putting attention towards is cases where the top results are very outdated. Understanding why those are being recommended and what can be done to get the “standard” packages higher ranked would probably be more useful to new users. I’d have to imagine that many/most new users do their searching via Google and not searches on Discourse or Stackoverflow.

4 Likes

I do not disagree, I disagree on where that information should be put, and by whom (or, better, representing whom).

2 Likes

Who are ‘we’ in this scenario? Of course, any individual is free to recommend JuMP or Turing as their go-to package. But it seems you are advocating that this should be made somehow ‘official’. That’s a different kettle of fish.

2 Likes

Well it’s one of the fastest not one of the quickest :joy: , but yeah it’s a pretty annoying delay .

1 Like

In principle yes, but depends on the discipline, the students, and what types of bushes they have to push.

100%. I want economics students spending almost all their time on learning economics. The coding is the tool where we want them to struggle a little to learn (especially if the struggle is related to the economics), but it is not the primary goal.

The other thing that experienced coders who spend all day on computers forget is that when things go wrong, students have no idea if it is because they are (1) using the language wrong; (2) using the package wrong; (3) that the package itself has bugs; or (4) that a particular combination of package versions has bugs because the [compat] of both was imperfect. It takes a lot of experience to be able to triage those and matlab/fortran (and, largely, python in practice) don’t have this issue to the same extent.

Sure, but CSV.jl was also broken for weeks with a precompile bug earlier this year. It wasn’t even really practical to pin the version to an old one because the dependency graph of CSV overlapped with other key dependencies and held them up. I ended up getting so frustrated that I changed all my code to use DelimitedFiles instead but can’t imagine a student figuring that out on their own.

As always, not blaming individual developers here since I know how hard they are working without pay, but something is holding up either the community from coordinating on what are essential packages to keep working flawlessly, or the language itself makes fixing bugs especially difficult (e.g., maybe dependencies are especially fragile). That said, things are much less fragile than they use to be.

And while CSV is somewhat obvious one to google, try something likes splines or interpolation in julia. The Interpolations.jl seems like a good first choice via google, but doesn’t support irregular grids for cubic splines, doesn’t support AD, and may not have consistent benchmarking to ensure it remains high performance. After that, there are a huge number of other interpolation packages that users are forced to choose between. And they have no idea which ones are maintained, which ones were really just a side-project for someone learning julia, which ones were written for no good reason because alternatives already existed, etc.

This doesn’t happen in python because you just go to Interpolation (scipy.interpolate) — SciPy v1.8.1 Manual as your default. Since having a monolithic package just doesn’t work for julia, I tend to agree with Chris that the best solution in the SciML approach where you have a consistent wrapper (which can be more easily coordinated on) which then calls out to other algorithms. e.g. GitHub - SciML/Optimization.jl: Local, global, and beyond optimization for scientific machine learning (SciML) and others. But for this to work, people need to contribute and maintain those wrappers since the SciML developers can’t do it all on their own.

11 Likes

And thus google makes it hard to find DelimitedFiles.jl, which is probably what many people googling this on day 1 actually need.

I made Add links to alternatives to the readme by mcabbott · Pull Request #1006 · JuliaData/CSV.jl · GitHub to add some links to CSV.jl. Better suggestions from people who actually know this part of the ecosystem would be helpful.

6 Likes

And this is exactly the sort of information I want to see on a blog with a very visible Package Ecosystem Reviews section. I will disagree on something; while I think it’s important to highlight the go-to tools, there’s also benefits to introducing the “lesser” alternatives to users:

  1. Sometimes the now-lesser alternative is more actively maintained and developed so it will very soon become the go-to. Users would want to have a heads-up, even try it out and give feedback to help development.

  2. Sometimes the alternatives are more usable in some situations e.g. a plotting package with very low time-to-first-plot but relatively fewer features.

  3. Focusing only on the go-tos makes it easier to overlook drawbacks. Pros and cons of the go-tos and the alternatives are good information for users and good ideas for developers, no matter which packages they choose.

Of course, it’d be good to vet for the most actively maintained packages and dedicated, communicative developers; there’s only so many packages a blog post should summarize. If the most prominent alternative is AbandonedDabble.jl, I would be in full favor of the blog post only discussing the go-to.

5 Likes

But these are not orthogonal, and there is no need to choose one.

(1) is essential for a thriving, innovative ecosystem.

In an educational environment, (2) is IMO the job of the professor. You already select the textbook, the set of problems to solve in assigments and exams, even the language students should use (Julia). Why not provide guidelines on packages? Just tailor the guidelines to your expectations, style, and students’ previous experience.

11 Likes

My point is that students are struggling with problems that they wouldn’t have in other languages. We can close our eyes and pretend that Julia is doing fine on that aspect or try to improve. I am raising a real issue I experienced (multiple times now) in courses where Julia was adopted.

If I want to teach subject X and I choose the language Y, then students should spend a fraction fx of their time in X and a fraction fy on Y. Whenever Y == Julia I observe fy > fx and whenever Y != Julia I observe the expected fx > fy.

9 Likes

On what subject? Perhaps a tutorial ficused on the packages you are using is missing?

Perhaps the students know better “the other language”? (I cannot possibly believe that the sentence apply to every other language). Having some experience in Julia and Fortran I take more time in other languages to do anything, but that tells more about me than about the languages.

3 Likes

I do think that the multiplicity of choices can be an issue. Some contexts in which I’ve seen this happen:

  • when beginners (in Julia) do not know (yet) the ecosystem => they try to find an easy-to-learn, easy-to-use, least-surprise package which will suit their needs now
  • when professionals need to use a 3rd party library in some part of a bigger tool they intend to use and maintain over the course of several years => they want to pick the most stable, well-maintained one

These are two very different needs, at (I think) two ends of a wide spectrum. I’m not sure there could always be a unique “community choice” to recommend, but JuliaHub does help. And maybe there could be some way to get a better, more comprehensive list of alternatives when looking for packages doing a given task.

As for the specific issue of ensuring that, in an educational environment, students spend more time learning the course topic, and not too much time struggling with the language, I tend to agree with @mbaz’s view.

I don’t have any experience using Python for teaching. But I have taught applied math courses with assignments in Matlab, C++ and Julia. My experience is that:

  • with Matlab, I have to ensure that everything needed for the assignment is available in a plain Matlab setup. Simply making sure Matlab is properly set up on every student’s system can in itself be quite challenging; if using a 3rd party tool is really necessary, I give it to them alongside the assignment
  • with C++, I have to give the students a short list of libraries they can use (I’m thinking of things like Eigen or Boost). Otherwise, they tend to re-invent the wheel by themselves
  • with Julia, the situation is more or less the same: I tend to orient the students to a set of well-tested packages that I know can be used for the task at hand. The difference is that if they want to try something else, they can (and they often do!)
10 Likes

I’m a little confused about this issue. Figuring out which of several possible packages one should use for a task is a problem that I have encountered in every programming language I’ve ever used. I’m unclear why or how this problem is worse in Julia than in other languages. If anything there are fewer options to cull through, which makes it easier to find the package to use.

13 Likes

The first step is for us to agree that it is different, before diagnosing what can be done about it.

As someone who works with many different languages, I can tell you that it is far worse for basic packages and funcitonality and probably similar for fancier stuff. The places where people are burned are not where they need to shop around for a package, but where other languages have a single mono-package that just works and covers 90% of the needs.

  • In matlab you have a large number of built-in functions that work and cover almost all of your needs. If matlab’s stuff doesn’t cover a particular feature, you write it yourself or purchase a commercial product (e.g. optimizer) which have very few bugs and excellent testing.
  • In python you choose scipy/pandas/matplotlib (and in many cases, just the whole of pytorch as they get good coverage) and it all just works with very few bugs. For AD you only really choose betwen pytorch, tensorflow, (or if you are really advanced, JAX).
  • For fortran you either write everything yourself without packages, or you pay for something like NAG which works great and has few bugs.

So counting the number of packages in each languags, and the fact that most python packages are garbage, is a red herring. If you want to do something a little crazy, you will have to shop around for good packages in both langauges. But if you are doing really basic stuff, with python/matlab there is no shopping for packages because there is a monolithic baseline. No thinking is required, and it works and covers most of your basic needs. You never have things like the premier CSV package just not working for weeks.

Julia is not a monolithic package sort of language, but it could have monolithic packages for users to access (e.g. either in wrappers like SciML does, or just making sure that there is a single interpolation package which everyone uses as the baseline). In the meantime you would have to give them a decision treedecision tree (e.g., if you want to do this cubic splines with a regular grid or linear interpolation then use package A, if you want to do cubic splines with an irregular grid, use package B, if you want to solve a linear system of equations and want preconditioner type A with iterative method then use package C, if you want a different precondition type B then use package D, etc.).

But lets say that existed and could be navigated. The broader issue here is that for people who don’t code for a living (and especially students) picking packages is just the tp of the iceberg unless the code quality and integration testing of the “baseline” options is superb.

Otherwise, the real problem is triage if things go wrong. Did the student use the language wrong? Did they use the package wrong? Does the package have a bug? Does the package have particular incompatibilities with other packages in the dependency graph because [compat] are tough to write so if they downgrade some dependency manually it would work fine, has the package become more incompatible with dependencies since the suggesed decision tree was written? etc, It is hard enough for Julia experts to navigate that triage, and it is impossible for students just learning the language.

15 Likes

In Julia it is hard to find the alternatives, they are hidden very deep in popular search engines. If you are in the community for years you know what works and is well-maintained, if you are just joining today you have a high chance of hitting a poorly maintained package.

Action items that could improve the situation:

  1. Recommend JuliaHub to every new comer as the default place to search for packages.
  2. Provide a feature in JuliaHub that shows links to similar packages in a graph-like structure.
  3. Provide a feature in JuliaHub where common criteria such as number of stars, number of maintainers, last date of commit, etc are shown as visuals to help beginners understand the “healthy state” of a package.

Regarding the graph-like structure, a user reading the page of CSV.jl should be alerted of the alternatives and the community could suggest new links if they are missing.

8 Likes

Kind of like what it is now?

(minus point 2, which isn’t easily done without lots of people coming up with these links, and the last commit date, which can be a bit of a red herring for well established packages)

2 Likes

Regarding the “similar packages” graph structure, it should be easy to build a initial version that just scans the README files and docs and tries to build edges of similarity.

Back in the old days of Julia I created this interactive D3 visualization of packages using LightGraphs.jl (now Graphs.jl), Github.jl etc:

It would be easy to adapt and show a similar dag for each package page.

I am happy to provide any script that can help with the task.

7 Likes

Regarding point 2, it shouldn’t be too hard to scrape publicly-visible Project.toml files (excluding those in forks) for some sort of similarity search algorithm.

I just googled “python csv reading”. The top result is for Python’s built-in csv package, but that’s probably not what I want since it returns each row as an array of strings. It also doesn’t mention pandas or data frames anywhere. No hint that there might be something else I should consider using. The second result is a non-official blog post that does cover both the built-in csv package and using pandas to parse a csv file into a data frame. Most of the links in the first page of results to mention both methods. When we did a CSV parsing comparison between Julia, Python and R, the Python folks said “Well, of course the pandas CSV reader isn’t fast, you should be using PyArrow for fast CSV parsing”. There is no mention of PyArrow in any of the first page of Google results on python csv reading.

When I google “julia csv reading” the top result is this article, which demonstrates how to use CSV and DataFrames to load a CSV file. I didn’t make it all the way through, but I did make it up to reading dates and booleans encoded with strings and everything worked. The only hiccup was that the tutorial didn’t explicitly tell you to load the StringEncodings package, but googling julia enc"windows-1250" got me to that package. (I’m not cheating here, I did not know a priori where enc"windows-1250" came from.) Overall, it’s an excellent tutorial, I have to say—thorough and correct. The remaining google results for CSV reading in Julia are all results for the CSV package. There are no results for any other ways to read a CSV in Julia besides the “official” CSV.jl package.

All of this is not to argue that we cannot try to do better—being “no worse” than other languages has never been good enough for us—but my point is I’m still not seeing an issue here that seems worse than other languages.

9 Likes