How to know if a package is good?

And that is where we get lost in the specifics of the example @StefanKarpinski . We are not mentioning CSV.jl as the problem. It is a general problem with the ecosystem. Lots of duplication and lots of packages developed in parallel because of lack of transparency regarding similar efforts.

We can do better to (1) help developers identify a common cause and (2) help users find the most stable and maintained option.

5 Likes

Those are articles benchmarking performance. Not basic usage.

The difference is that people only look around to alternatives if things aren’t fast enough and most of the time it doesn’t matter. If you are just loading something into pandas and learning the language, you don’t even think about it or care. Only advanced users would even need to google csv loading since all basic usage just says import pandas and then doesn’t think twice about looking for other packages.

5 Likes

Let me emphasize this in a separate reply :top:

But it is for basic stuff! Try to do interpolation, plotting, loading files, manipulating data, (let alone things like neural network) etc. in matlab or python vs. julia. In the other languages you just don’t have to think or worry about compatibilities. You don’t even shop around for packages at all. You import numpy, scipy, pandas, matplotlib and you have almost everything you need. Maybe you import json or something like that, but the “baseline” package is obvious. In matlab you don’t even need to figure out the numpy/scipy/pandas/matplotlib combo because it is built in.

After you get past the basics, then I agree that the languages are not all that different for searching for packages.

4 Likes

Figuring out how to automatically determine which packages are related to each other would certainly be cool and useful. It seems like a moderately challenging data mining/NLP problem though. An initial approach would be to do some clustering on tf-idf between README documents. As someone who used to do this kind of data science to help people find things on marketplaces (Etsy, specifically), I do suspect that this is harder to get right than one might naively guess.

7 Likes

Yes, this would be just used as a warmup. The real feature we need in JuliaHub is the ability to update these links as a community. If someone goes to the CSV.jl package on JuliaHub, we could show a dropdown menu for logged users to select similar efforts. These would then be reviewed and accepted/rejected.

2 Likes

I think part of the issue here is that in an extremely well-established language like Python, there are large organizations with deep pockets that pour resources into a few high-profile packages in key problem areas. It’s not there is less duplication (how many re-implementations of numpy are we up to by now?), it’s simply that it is easier to identify a well-resourced solution.

And yes, “Julia is not as well established as Python, does not have as much corporate and institutional support, and does not have as many developers” is a perfectly valid criticism! … but this observation does not lead to much productive discussion.

Are there ways we could improve things like search tools? Yes! Are there areas where it would be useful for people to write more documentation? Definitely!

But “be more like Python” == “get more resources” is not really actionable. (Nor is the monolithic Matlab model of “bundle ‘official’ versions of everything” a productive approach for decentralized free/open-source development.)

27 Likes

I apologize if the following is a little ranty, but I really dislike discussions about “someone should do a thing” when it’s not at all obvious who that should be.


Maybe I’m a different kind of programmer due to not being “formally trained” as a data scientist, but none of those are obvious to me. I literally only know they’re the go-to package because someone told me to use them in a course I once took. They’re also not easy to use - I remember being very frustrated with having to use (to me at the time) arcane syntax and having to fit my code to the @ sprinkled paradigm, just so I could have my code finish in time. It’s not obvious AT ALL that I have to do this to make it work.

Like, I get that it’s easy to say “look! python has all this nice ecosystem that everyone is using!”, but did it occur to you that this has been a deliberate effort by the python community to make it so? And that this has not happened over night and has had development behind it for YEARS by now? I’m not joking with this, the original python philosophy was in part inspired by UNIX philosophy after all, to have one obvious way of doing things.

I do understand that you want something like that in the julia ecosystem, but going on and on about how you want it to be a thing and complaining that it isn’t does not help you in achieving that goal of yours.

Who, exactly, is supposed to review that? JuliaHub is a commercial offering. Do you want to have unpaid volunteers do that review work? Why should I volunteer for this?

This, to me, is the core problem with all this discussion about “something should be done about X!” - it completely ignores that someone actually has to do that thing and this may not be something that people are willing to volunteer for, because it’s hard work that people would very much like to get paid for. You’re free to donate your free time to do that, instead of arguing that someone else should do it for you. I for one won’t - I’m much happier fixing little bugs & doc issues I notice in the julialang repo.

13 Likes

You’re presuming that someone already knows pandas and how to load a CSV file in Python. If they already know that, then they don’t need to search for how to load a CSV file. My example shows that for that simple example that keeps getting mentioned here, (1) the search results for Python actively lead you down the wrong path (csv library) and (2) the Python ecosystem is fractured and confusing (csv vs pandas vs PyArrow), whereas the search results for Julia lead you directly to the right approach with a clear and working tutorial of very high quality. There are no benchmarks in any of the google search results in either language, so I’m not sure what that’s in reference to.

Perhaps CSV parsing is not a good example and there are other things that are hard to find the right way to do in Julia. If so, some concrete examples would be helpful, since otherwise this discussion is kind of abstract. What, specifically, have people had a hard time finding. As another common example, I just googled “julia plotting” and the top result was the Plots tutorial, which seems like a great result. The tutorial code works flawlessly and installing all of the Xorg graphics stack took less than a minute, and worked perfectly. I was able to install matplotlib using pip3, but I have not figured out how get it to actually show a window with a plot in it.

5 Likes

I would happily review as someone that is part of the community and wants it to succeed. This is usually a single scan in the README followed by a button press accept/reject.

There is no problem with updating a commercial offering, we are not coding anything, we are just providing feedback to an already implemented system. Fixing the database when it is missing something.

1 Like

To clarify, we are asking for a simple feature: a dropdown menu where logged users could select similar packages whenever they are reading a package page.

The work afterwards: logged users could propose links with the menu and other users could review the links to accept/reject. This is not that much work if you imagine that the community will be willing to help.

100% agree. scipy/numpy/matplotlib/pandas combination as a low-bug default emerged through massive investment and they didn’t start out as the “good enough” standards from the beginning.

This entirely explains why the experience is so different, but not what to do about it.

I agree, but at this point there are still a lot of people who don’t seem to recognize that it is even different at all - which means it is hard to fix. I assume it is because they are using Julia mostly on its own, or they use both languages in different ways where they don’t run into the same basic usage issues. A Julia-specific solution to the problem needs to start by agreeing there is a problem in the first place, and then whether it is worth addressing.

Exactly. Julia cannot forge the same path (you can’t magically make organizations with deep pockets appear), but if people agree that there is an issue then they can decide whether it is worth addressing.

I don’t think that package discovery is the entirety of the problem. Even if you can discover the packages, the features are often insufficiently overlapped so you end up having to choose not just on the goal (e.g. interpolation) but also on the specific features (e.g., regular vs. irregular grid, dimensionality, etc.).

The best approach I have seen so far for dealing with this in a decentralized language is the SciML approach. Get everyone using wrapper packages as the default “no thinking” baseline, then those packages can do integration testing with downstream packages and decide when they are sufficiently solid to wrap. If one downstream dependency is buggy or has incomplete feature coverage they can swap them. Otherwise, people can use the direct packages as they evolve and experiment as they wish, but intro users can keep things simple.

5 Likes

That is how everyone learns it. But in python the information conveyed is: import pandas, numpy, matplotlib, scipy and you have everything you need.

It isn’t possible to get to that point for julia, nor is it necessarily a goal, but it is important to recognize it is different.

I have dedicated an enormous amount of personal effort in education, evangelizing the language and packages, and have funded open-source projects and summer of code students for years to try to contribute to this issue.

Rant away, but do so understanding what others have tried to do here.

4 Likes

Wrappers have downsides as well, because they tend to enforce a “lowest-common denominator” API which can be limiting, especially as the problem domain becomes more complex. It also kind of puts the cart before the horse — it is much easier to develop a good wrapper API after you have multiple high-quality competing implementations.

For example, if we had somehow decreed 5 years ago that all finite-element packages in Julia should follow a common “FEMwrapper” API (e.g. based on JuliaFEM), that would have prohibited innovations like Gridap.jl.

7 Likes

FWIW, it’s not ready for a big announcement, but we’re almost at having a centralized documentation as well.

https://docs.sciml.ai/dev/

There’s a lot more packages we need to add to it for it to really be all of SciML (Add more of the packages to the docs · Issue #2 · SciML/SciMLDocs · GitHub)

50% agreed. It’s moreso that there are some domains that are better for wrapping, other domains where it’s much harder. Linear solvers, optimization, differential equations, etc. getting all of those uniform, using the same keyword arguments, and all being efficient on the same interface isn’t bad at all. PDEs in general… yeah that’s hard to put a single interface to, so we welcome the fact that people will build all sorts of PDE solvers and we’ll try to incorporate them all into one really smart symbolic system.

6 Likes

I’m surprised nobody mentioned Julia.jl yet. It includes precisely a section listing the two main CSV readers (CSV.jl and CSVFiles.jl): Julia.jl/FileIO.md at master · svaksha/Julia.jl · GitHub

I think the ecosystem would benefit from making this kind of resource more visible, e.g. by hosting it or linking it on JuliaHub. R’s CRAN has a link to their task views on the home page.

2 Likes

Julia.jl is too uncurated and full of junk IMO.

3 Likes

It’s not a problem with a single package, no, but it sure doesn’t scale. What if the package only has a sparse README? What if you are not a domain expert and can’t decide whether a package is “good” for its field? Would you feel comfortable recommending a package about something outside of your field of expertise or outside of what you personally use? As a concrete example, would you feel comfortable recommending & linking packages related to cryptography? I think we’re going to run into “I’ve reviewed all packages I care/feel knowledgeable about” much sooner, rather than later.

The pond is much bigger than you think.

Who is to update that? Who is to maintain the quality of recommendations? The links between packages? What happens when a package is no longer maintained, but still highly recommended? How much work do you estimate would it be to periodically scan the registered list of packages for packages that were once, but are no longer recommended? You’re speaking in very abstract terms about something neither you nor I have internal knowledge of how to actually pull off and support long term. All I can bring to the table in terms of relevant experience is moderating a few small scale forums (think less than 50 people) and accompanying wiki, but that was already a support nightmare and just sucked up waaay more time & resources than was practical.

I don’t have any stats on that, so admittedly I’m talking out of my subjective POV, but I’d wager there aren’t actually that many people logged into juliahub, compared to the number of general search “users”. Even if that’s the case, the vast majority probably do not want to spend their spare time reviewing random packages. So we’re now speculating about whether the community would be willing to help - with what concrete action? Who from the community is to do that work? Are we all expected to donate, say, 2 hours each week to this effort? What if I don’t want to do that - am I now shunned because obviously I’m not participating in this particular effort to curate a commercial offering made available to the community by a company? I do find my time well spent actually improving documentation & fixing issues, thank you very much.

You do realize that from my (and lots of other python users’) POV, that ecosystem is a niche? Further, what’s to stop us from doing the exact same thing, by writing blog posts about how people should use X in julia? In fact, that is exactly what is happening!

Why not? To a lay person in that field like me, the SciML ecosystem seems to be exactly what I’d want for that, and that’s just from what I picked up through osmosis on this forum and slack.

I do understand it, I really do. I myself am known to be the “julia evangelizer” in my circle. I recognize that you have done the funding and evangelizing as well, which is great! But please also recognize that we’re not all able to do it to the same level, be it due to financial or other reasons. I’m in no position to fund GSoC students - just two years ago, I applied myself after all!

However, none of that takes away from the fact that talking about a thing does not make that thing happen. Saying that someone should do X, does not make thing X closer to a reality. Saying that X should be changed to do Y does not change X to do Y, nor does it say anything about how it can be sustained long term. Encouraging people to step up and do work for free is not a sustainable model - there are enough case studies in open source to prove that.


I’m really trying to understand what exactly should be done and most importantly, WHO should do that. The general vibe I’ve been getting from this and the other recent discussions has never been a concrete “I want X to be a thing, so I’m making it a thing”. It’s always been getting someone else to do a thing, seemingly without considering what it would require from them to do that thing and continue to support it (which is actually the hard part, as I’m sure you can attest to as well). Maybe I’m wrong about that, but that’s the impression I got.

2 Likes

For sure. For some you can slowly build up coverage where people deviate to use the raw interface as necessary. The big concern is that you need to think ahead in the interface design if you want to slow-roll features (e.g. if you start with unconstrained optimization and slowly add in box constraints, nonlinear constraints, complementarity, will it break existing interfaces). But breaking interfaces is probably better than the alternative.

I agree. I think there are some more amenable to others, and anythign DSL-y will end up with a failure.

But part of this is a community thing. I worked with quantecon to fund some of the work on the consolidation of wrappers (and sciml people did the vast majority afterwards) but it hasn’t yet become a rallying cry in the community. If it did, and people saw it as part of the solution they could contribute to, then it could make progress very quickly.

I should be clear here in that I’m not suggesting that we shouldn’t ask for JuliaHub to support something like this. I just don’t think it’s sustainable, AT ALL, to ask community members to freely donate their time to make a commercial offering by a company better, even if that company is founded solely for furthering julia & selling julia-based products. Hence, to me at least, the idea of a “community curated set of packages” immediately falls flat on its face, especially once you start looking beyond the “numpy, scipy, etc” equivalent-in-julia bubble and start moving into (to you) niche topics.

2 Likes