Yes, of course. The results are the same when logged out of google. It’s easy for anyone to try the exact same thing.
Regarding this specific example with CSV files… I had students who preferred to follow the official Julia documentation and attempted DelimitedFiles instead. They had issues with the stdlib and thought that they were doing something wrong. After all, they thought: “this is a stdlib, I must be doing something wrong…”.
DelimitedFiles should absolutely be removed as a stdlib… and it is: remove DelimitedFiles from being an stdlib by KristofferC · Pull Request #45540 · JuliaLang/julia · GitHub. It should also have some notes added to it indicating that CSV should generally be preferred. If someone wants to take something straightforward and actionable from this, adding those notes is a good action item. Although DelimitedFiles is occasionally good for simple files representing numerical matrices and such.
JuliaHub or any other community-driven platform could introduce metrics about the “healthy state” of a package, or at least introduce links to similar efforts to help make the redundancy explicit.
JuliaHub includes both GitHub star count and recent download counts as indicators of activeness. If you or anyone has ideas for better indicators, please feel free to propose them.
At a higher level, there’s a little disconnect in expectation here: I think it is neither the responsibility of nor an appropriate role for a third party commercial platform like JuliaHub to decide which packages people should prefer in the open source ecosystem. That is one of the reasons I’m advocating for people to create and establish open source, community driven graphs/metrics/whatever. If the community agrees that these are useful, then it’s fine for those to be included on JuliaHub and JuliaPackages and wherever else, but it’s just not a company’s place to dictate what packages “win” in the open source ecosystem.
IMHO, as a leader of this community you could approach the discussion differently by asking more questions in the thread to guide the discussion towards improvements. When you try to find evidence that go against the issues raised, you are taking a position of disbelief, which then triggers a whole set of arguments that are not productive.
I’m trying to determine what what concrete steps can be taken to improve things. Broad, generalized complaints don’t help improve anything. Since the general complaint here is that basic data science packages are hard to find, I tried what a new user would do when trying to figure out how to do things like reading a CSV file or plotting something. It’s unfortunate that you interpret this as me trying to “find evidence that go against the issues raised” or “taking a position of disbelief”. I’m trying to find specific things that can be improved. My specific findings are:
- If a new user is trying to figure out how to load a CSV file, they are quite likely to learn the best way to do it by googling.
- If they limit themselves to stdlibs, they may be mislead into trying to use DelimitedFiles, which does work for some kinds of data, but not general heterogenous tabular data. So it’s a good thing we’re removing that as a stdlib and notes should be added to the documentation referencing CSV as a probably preferable alternative.
- If a new user is trying to figure out how to plot something, they are likely to find the official Plots tutorial, which is both a good package choice and a working, current tutorial.
- Interpolations apparently doesn’t cover some basic use cases and may be a pain point (again, unclear to me as I’ve never really used this functionality). There is already prominent linkage to other interpolation packages, but perhaps that could be improved by someone who knows about this. It may also be possible to add whatever features new users are likely need to the Interpolations package.
There have been some responses along the lines of “but if I’m using Python then everything I need is in scipy/pandas/matplotlib”. The thing is I’m not sure what do with that. Sure that may be true, but this isn’t Python. (And sometimes it seems that the pandas CSV reader isn’t even the one you should use—you should be using PyArrow instead; this is according to the author of pandas, among others.) Julia doesn’t do monolithic superpackages like scipy. Some people may think that it should, but I’m not one of them. So it’s just unclear what the actionable aspect of this observation is, aside from trying to make sure that the “google what I want to do” approach works as well as it can.
Start with the fact that multiple professors at different universities are sharing the same point of view. You may disagree with this point of view, but there is certainly an issue somewhere to be addressed. Now, what questions can you ask to identify the core issues? Trying to prove the opposite is not the best leading strategy.
Yes, that’s why I’m here asking exactly those questions.
I will try to implement the initial set of links using “Deep NLP” like I did in this paper: https://arxiv.org/pdf/1712.01476.pdf
Awesome. I look forward to that.
In the meantime, nothing blocks the development of a textbox menu in JuliaHub so that users could manually insert links in the similarity graph.
This is actually the part that I think is a massive product effort and have no appetite for building. Sounds simlpe… “just a text box”. But any time you have humans giving input, it gets complicated. The text box has a lot of options, so it will need a search filter functionality. I’m sure there’s some fancy JavaScript widget someone has built that already does that well, but someone needs to research that and integrate it and make sure that it works nicely on the site. And any time there’s human input, there’s danger of gaming and spam and then you need systems to detect and deal with that. People will be proposing edits to a community knowledge base and as soon as you have that, you need a system to let people review and approve or reject those proposed edits. That means you need a notion of “community admins” who are allowed to review things. And moderation tools like allowing community admins to block people who keep proposing bad edits. And logging of all actions taken by admins. And features for platform admins to manage the community admins. It sounds simple, but it’s really a whole can of worms from a product perspective. It’s all doable, but it’s… complex. (And all that for a feature with zero commercial demand.)
It seems much more plausible to me if the “related/alternative packages” graph is maintained externally as a community resource. For example, it could be a TOML file that’s maintained on GitHub. Then changes would be proposed and reviewed using the same tooling that we already use for code and registries. Or even simpler: people propose edits to READMEs linking to other related packages.