How to know if a package is good?

juliohm · June 1, 2022, 11:41am

For end-users, could JuliaHub provide links to similar packages whenever someone is reading a specific package page? Or maybe a feature where members could suggest links manually for review?

For example:

If, as an end-user, I find InMemoryDatasets.jl in JuliaHub, I would like to be alerted that DataFrames.jl exists. Similarly, if I come across DLMReader.jl I would like to be alerted that CSV.jl exists.

To provide context:

A student asked me yesterday “professor, how do I know if a Julia package is good?”. She had installed a package that was not well-maintained, was having issues with basic examples and didn’t know if that package was the go-to package to solve the problem (guess what, it was not). I feel really bad when that happens, and wish students didn’t have to waste their time trying to sort out these things by themselves.

sylvaticus · June 1, 2022, 12:58pm

I think you could just teach students what are the github stars/forks, how to look at contributors’ page and commit history…

juliohm · June 1, 2022, 1:03pm

That is what I did. I don’t think it is enough though.

mihalybaci · June 1, 2022, 1:04pm

When you search JuliaHub for packages, the site already displays Github stars, version number, and the number of new users along with a graph (not sure how “new” is defined). So if you search for “CSV”, it’s pretty clear than JuliaData/CSV.jl is probably a choice to start with than tk3369/CSVReader.jl. The results don’t appear to automatically sort by the number of stars for filtered results as it does for the entire list, so maybe there is something there that could be done to help users find active packages.

juliohm · June 1, 2022, 1:08pm

Yes, we need some way to make the redundancy explicit. Beginners tend to think that CSVReader.jl is a better choice just because of the name. Those coming from MATLAB tend to think that DLMReader.jl is a better choice just because of the name. These users should be aware that CSV.jl is the one most widely used. So links connecting all these packages would help, it is a community detection problem. One can easily cook an algorithm to cluster these packages based on their README and docs contents to start with a first version for review.

mcabbott · June 1, 2022, 4:09pm

The low-tech solution is just to start making PRs. I wish it was standard for every package’s Readme to have a section linking to other similar packages, and explain a little how they differ. Just a sentence or two. Is this the new complicated high-performance package, or the old simple one? Do they aim at different sub-ecosystems?

CSV.jl should really say how it differs from the packages you mention, and for what purposes you might want to stick with the standard library’s DelimitedFiles.

Linking to comparable matlab/python packages is also probably a good idea. If nothing else, google will notice that their names appear together.

oheil · June 1, 2022, 4:53pm

We could bring telemetry back into discussion.

alejandromerchan · June 1, 2022, 5:03pm

[quote="juliohm, post:1, topic:82133]

A student asked me yesterday “professor, how do I know if a Julia package is good?”. She had installed a package that was not well-maintained, was having issues with basic examples and didn’t know if that package was the go-to package to solve the problem (guess what, it was not). I feel really bad when that happens, and wish students didn’t have to waste their time trying to sort out these things by themselves.
[/quote]

I dealt with that a lot with this problem in R, back in the day. And I have no idea what to do and while CRAN does list some suggested packages for different fields, that doesn’t cover every possible scenario. When I moved to Julia, for whatever reason, I felt that the barrier to becoming a developer got lowered. This can be totally subjective, and I just became just a braver user. Regardless, I think that Julia opened my eyes to open source, and testing, and Github, and other tools, in ways that years in R never did. So, while I agree that the proliferation of packages can be a problem, we can also solve it through education. Look at maintainers, activity on Github (or whatever), stars, last commits, and different things that show that a project is active. I like that Julia makes it easy to create a package and facilitates experimentation, that’s much better, in the long term, than becoming trapped in some ecosystem because it’s “blessed”.

juliohm · June 1, 2022, 5:52pm

I agree with everything you said, except for the statement above. I don’t think that we can improve the current duplication of efforts with education: the authors of these packages are well-educated, experienced programmers, they often have PhDs.

Let me emphasize one more time the two types of arguments in this thread:

There are arguments in favor of a creative, exciting, experimental, free-to-do-whatever-you-want environment where developers can try out their ideas.
There are arguments in favor of an organized, revised, stable, free-of-bugs environment where end-users can safely try the language and learn how to program.

The proposal I wrote above regarding links on JuliaHub and most of my comments in this thread fall in (2). The counter-arguments fall in (1) and we will never converge to a concrete set of actionable items that way. I suggest we keep these two goals in mind and try to improve the ecosystem in a Pareto front.

lmiq · June 1, 2022, 6:37pm

IMHO that problem of “suggesting” packages and explain possible differences can be easily solved by good blog posts, written by developers or experienced users.

As a community (in formal terms) I don’t think it would be good to have “official” tools for searching selected packages, that is against innovation.

Also, sometimes the most popular alternatives aren’t even the best ones, and keep their popularity by innercia and for being good enough for most cases, or become good enough by tuning. That is what happens with the monumental number of non-linear solvers and global optimization solvers.

liuyxpp · June 2, 2022, 1:35am

It is a good skill to learn. Don’t spoil your students. Let them grow by pushing them into bushes.

juliohm · June 2, 2022, 1:50am

I have to disagree. Students have a lot on their shoulders already and need to master the actual subject (linear algebra, real analysis, …) in different courses. They shouldn’t waste time with these issues that we are discussing.

Example:

If the homework consists of loading a CSV file and doing linear algebra, they shouldn’t be wasting most of their time trying to figure out why the file is not loading properly.

liuyxpp · June 2, 2022, 2:06am

I don’t think learning the specific knowledge is the most important thing during their studying period. The most useful thing they will learn in University should be the problem-solving skill and know how to find resources to actually solve the problem. If you want to solve the problem use the tool, you should know your tool well. The time paid during this process is the most valuable experience which can be applied to their future career.

juliohm · June 2, 2022, 2:15am

Correct if the problem-solving skill is related to the subject of the class: linear algebra. I am not teaching a programming class. Any issue with basic CSV loading is a problem specific to Julia that doesn’t pay off in the future of this student.

The actual net result:

Students take the class, learn that Julia has tons of packages that are poorly maintained, and after the class is over, they switch to Python and R where the CSV loading part just works.

I think Yuri’s post is intimately connected to this experience, and I totally understand why he stopped recommending Julia.

liuyxpp · June 2, 2022, 2:23am

Well if you think the CSV reading part is not critical for your course, please give a clear instruction on which package to use. Or better, written a loading function for them so that they do not bother in this part anymore. I’d say it is the instructor’s responsibility to determine which part is important for you class and prepare the non-critical but time-consuming part for your students.

juliohm · June 2, 2022, 2:32am

I could give a clear instruction on the CSV part, but that is not ideal. They should be able to navigate the Julia ecosystem by themselves and easily find out that CSV.jl is the go-to package. Right now they can’t do that because we don’t provide a mechanism that makes this redundancy explicit (CSVReader.jl, DLMReader.jl, DelimitedFiles, …).

Also, we cannot anticipate all the paths the students will take to solve a problem. For instance, they may decide to use a class of polynomials for which a package exists. They may write the algorithm themselves. They may try a different class of polynomials… There is too much room and instructors shouldn’t be pointing to specific packages.

Benny · June 2, 2022, 3:50am

Fully anecdotal (former student), but in an academic setting where software is a tool but not the focus, letting students learn programming with no guidance was a disaster. To be fair, the department admins required us to take introductory programming courses…in a different language from what was expected by the major’s instructors. Only a couple students had prior programming knowledge to handle that language transition smoothly, the rest of us stumbled through desperate spaghetti code and tight deadlines, never given the time to actually learn properly. And why would the instructors give us that time? After all, it wasn’t their job to teach us programming.

I don’t presume the students you had in mind were in my position, but if they’re really stuck at outdated packages, some programming instructor failed (or maybe wasn’t there) to teach these students the basic savvy to look up or ask “best way to load a CSV file?” on stackoverflow or discourse. I don’t think it’s your job to teach them programming, but it’s a lot less effort to document quick tips and tools you already know to be usable for the course’s purposes.

That said, I do think there is merit to having some centralized resource to gauge the package ecosystem. Forum responses at best have waiting periods, which people with deadlines may not be able to afford. Maybe 1 in 100 packages will document comparisons with similar packages, and even fewer will be updated. I agree with @lmiq that language-official suggestions would stifle independent development, but I would love an independent blog of regularly updated reports on the active packages by topic, open to pitches by independent developers. In one place (perhaps this new Forem thing), readers could get a lot of the information they need to make their decisions and perhaps learn how to find more information.

This isn’t a problem unique to Julia, btw. Python’s scientific computing ecosystem has gotten a lot of credit in this thread for being easy to navigate, but in less maintained/used/coordinated ecosystems, I’ve run into library choice confusion, lack of updated comparisons, less-good package inertia.

lmiq · June 2, 2022, 8:28am

Well, that is your go-to package. I never used it, and I use the included delimited files for everything I need. For the developers of other CSV readers, it can be quite frustrating that an “official” alternative is suggested above others.

Making the redundancy explicit is fine: someone can start a wiki page “CSV readers in Julia” and anyone can add whatever alternative he/she wants. But “officially” (in any sense that may have) suggesting one of them as the default go-to package is much more complicated.

This discussion resonates with some other discussions about the quality of the doc pages. I still think that as a community we should have a wiki-style documentation page, which could include the above lists of packages. The point that this wiki cannot be “official”, because main developers of Julia or of any package cannot be responsible for the absolute correctness of everything there, and actually sometimes it is the need to be absolutely precise that makes docs hard to follow.

Elrod · June 2, 2022, 8:39am

I’d like a CSV reader that focuses on low latency. CSV.jl advertises itself as one of the fastest, but in my use cases, it is far slower than alternatives as I wait 10+s to load files that R’s read.csv would load instantly.

But this will hopefully be addressed with future improvements in Julia’s precompilation.

mcreel · June 2, 2022, 10:00am

I think that this is already completely taken care of, if you make a system image with the appropriate packages included. Here’s an example:

❯ je
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.2 (2022-02-06)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using CSV, DataFrames

julia> @time data=CSV.read("card.csv", DataFrame)
  0.127135 seconds (224.55 k allocations: 14.726 MiB, 9.13% gc time, 87.87% compilation time)
3010×8 DataFrame
  Row │ wage   nearc4  educ   age    black  smsa   south  exper 
      │ Int64  Int64   Int64  Int64  Int64  Int64  Int64  Int64 
──────┼─────────────────────────────────────────────────────────
    1 │   548       0      7     29      1      1      0     16

Topic		Replies	Views
How can we create a leaner ecosystem for Julia? Statistics package , proposal , time-series , machine-learning	100	11303	October 15, 2020
Julia losing popularity among Data Science users (KDnuggets Software Poll) Community	145	21382	June 23, 2018
Fixing Package Fragmentation Community	70	6701	May 28, 2023
What can we do to make Julia grow fast? Community	113	14282	November 16, 2018
The State of the Julia Ecosystem Community	108	9623	January 31, 2019

How to know if a package is good?

Related topics