Yes, of course. The results are the same when logged out of google. Itâs easy for anyone to try the exact same thing.
Regarding this specific example with CSV files⌠I had students who preferred to follow the official Julia documentation and attempted DelimitedFiles instead. They had issues with the stdlib and thought that they were doing something wrong. After all, they thought: âthis is a stdlib, I must be doing something wrongâŚâ.
DelimitedFiles should absolutely be removed as a stdlib⌠and it is: remove DelimitedFiles from being an stdlib by KristofferC ¡ Pull Request #45540 ¡ JuliaLang/julia ¡ GitHub. It should also have some notes added to it indicating that CSV should generally be preferred. If someone wants to take something straightforward and actionable from this, adding those notes is a good action item. Although DelimitedFiles is occasionally good for simple files representing numerical matrices and such.
JuliaHub or any other community-driven platform could introduce metrics about the âhealthy stateâ of a package, or at least introduce links to similar efforts to help make the redundancy explicit.
JuliaHub includes both GitHub star count and recent download counts as indicators of activeness. If you or anyone has ideas for better indicators, please feel free to propose them.
At a higher level, thereâs a little disconnect in expectation here: I think it is neither the responsibility of nor an appropriate role for a third party commercial platform like JuliaHub to decide which packages people should prefer in the open source ecosystem. That is one of the reasons Iâm advocating for people to create and establish open source, community driven graphs/metrics/whatever. If the community agrees that these are useful, then itâs fine for those to be included on JuliaHub and JuliaPackages and wherever else, but itâs just not a companyâs place to dictate what packages âwinâ in the open source ecosystem.
IMHO, as a leader of this community you could approach the discussion differently by asking more questions in the thread to guide the discussion towards improvements. When you try to find evidence that go against the issues raised, you are taking a position of disbelief, which then triggers a whole set of arguments that are not productive.
Iâm trying to determine what what concrete steps can be taken to improve things. Broad, generalized complaints donât help improve anything. Since the general complaint here is that basic data science packages are hard to find, I tried what a new user would do when trying to figure out how to do things like reading a CSV file or plotting something. Itâs unfortunate that you interpret this as me trying to âfind evidence that go against the issues raisedâ or âtaking a position of disbeliefâ. Iâm trying to find specific things that can be improved. My specific findings are:
- If a new user is trying to figure out how to load a CSV file, they are quite likely to learn the best way to do it by googling.
- If they limit themselves to stdlibs, they may be mislead into trying to use DelimitedFiles, which does work for some kinds of data, but not general heterogenous tabular data. So itâs a good thing weâre removing that as a stdlib and notes should be added to the documentation referencing CSV as a probably preferable alternative.
- If a new user is trying to figure out how to plot something, they are likely to find the official Plots tutorial, which is both a good package choice and a working, current tutorial.
- Interpolations apparently doesnât cover some basic use cases and may be a pain point (again, unclear to me as Iâve never really used this functionality). There is already prominent linkage to other interpolation packages, but perhaps that could be improved by someone who knows about this. It may also be possible to add whatever features new users are likely need to the Interpolations package.
There have been some responses along the lines of âbut if Iâm using Python then everything I need is in scipy/pandas/matplotlibâ. The thing is Iâm not sure what do with that. Sure that may be true, but this isnât Python. (And sometimes it seems that the pandas CSV reader isnât even the one you should useâyou should be using PyArrow instead; this is according to the author of pandas, among others.) Julia doesnât do monolithic superpackages like scipy. Some people may think that it should, but Iâm not one of them. So itâs just unclear what the actionable aspect of this observation is, aside from trying to make sure that the âgoogle what I want to doâ approach works as well as it can.
Start with the fact that multiple professors at different universities are sharing the same point of view. You may disagree with this point of view, but there is certainly an issue somewhere to be addressed. Now, what questions can you ask to identify the core issues? Trying to prove the opposite is not the best leading strategy.
Yes, thatâs why Iâm here asking exactly those questions.
I will try to implement the initial set of links using âDeep NLPâ like I did in this paper: https://arxiv.org/pdf/1712.01476.pdf
Awesome. I look forward to that.
In the meantime, nothing blocks the development of a textbox menu in JuliaHub so that users could manually insert links in the similarity graph.
This is actually the part that I think is a massive product effort and have no appetite for building. Sounds simlpe⌠âjust a text boxâ. But any time you have humans giving input, it gets complicated. The text box has a lot of options, so it will need a search filter functionality. Iâm sure thereâs some fancy JavaScript widget someone has built that already does that well, but someone needs to research that and integrate it and make sure that it works nicely on the site. And any time thereâs human input, thereâs danger of gaming and spam and then you need systems to detect and deal with that. People will be proposing edits to a community knowledge base and as soon as you have that, you need a system to let people review and approve or reject those proposed edits. That means you need a notion of âcommunity adminsâ who are allowed to review things. And moderation tools like allowing community admins to block people who keep proposing bad edits. And logging of all actions taken by admins. And features for platform admins to manage the community admins. It sounds simple, but itâs really a whole can of worms from a product perspective. Itâs all doable, but itâs⌠complex. (And all that for a feature with zero commercial demand.)
It seems much more plausible to me if the ârelated/alternative packagesâ graph is maintained externally as a community resource. For example, it could be a TOML file thatâs maintained on GitHub. Then changes would be proposed and reviewed using the same tooling that we already use for code and registries. Or even simpler: people propose edits to READMEs linking to other related packages.