This post is an early proposal to tackle some discoverability and fragmentation issues on the Julia package ecosystem with information tools. The idea is to complement the call of gathering efforts and standardization with discoverability strategies and tools that take advantage of this and specific text information in packages.
The intuition is that providing these tools allows people to find the required package and avoid unnecessary package multiplication. I know that Juliahub has some of these capabilities, but the idea is to access them directly from the REPL.
- propose adding some optional metadata information metadata in
Project.toml. The idea is to add valuable data to categorize (topics, descriptions) the package, its current objective (e.g., experimental, user ready, etc.), and so on.
- full-text indexes for package names, descriptions, topics, readme, and perhaps others. Make all these searchable from the command line, perhaps from a REPL mode, so it could be easy to select topics or search for packages doing something instead of just remembering names (solving the
- index functions and their documentation of all packages such that any developer can search in the same way that (2)
- provide access to the (automatic and non-automatic) categorization of packages from REPL.
It is essential to be in pure Julia to simplify anyone running on its computer without needing a dedicated server (without extra setup). Of course, keeping indexes and data updated, keeping running services (or load data to solve each query) can be challenging. So, this task can be centralized, and consumers access it through a json-api client, but in principle, anyone should be able to run these tools on their computer. An alternative way is to produce binary bundles easily updated by fetching new ones.
I created full-text indexes to exemplify the idea.
I created two notebooks showing examples of working with text-package indexes, visualizing, and clustering them.
- https://github.com/sadit/Search-Julia-Packages/blob/main/search.ipynb shows an example of performing searches in different indexes; note that it just makes a JSON dump of registers.
- https://github.com/sadit/Search-Julia-Packages/blob/main/clustering.ipynb a UMAP visualization and DBSCAN clustering. The text is represented as a bag of words, removing some stopwords. Using cosine similarity, I computed the all-knn graph to compute a non-linear dimensional reduction (UMAP 2d and 3d) to visualize clusters. The notebook uses DBSCAN (estimating some reasonable epsilon parameters on 2d) to produce a list of clusters now using L2 on the 2d embedded dataset. I saved intermediate data since they can be used to create interesting things, like graph analysis and visualizations.
These examples use bm25 indexes and a bag of words since they are pretty fast and explainable. Also, very scalable. It is also possible to use dense retrieval at the cost of using high-cost language models.
Of course, this is just a proof of concept or even less. Still, it is possible to use several existing packages to tackle, or at least help overcome, the discoverability and fragmentation issues, giving search tools and categorization tools. It can work in a short time. It can evolve and improve over time and be orthogonal to other solutions.
I will be happy to discuss and collaborate on an information ecosystem for packages.