Searching and clustering julia packages by name, description, topics, and readme

Hi,

This post is an early proposal to tackle some discoverability and fragmentation issues on the Julia package ecosystem with information tools. The idea is to complement the call of gathering efforts and standardization with discoverability strategies and tools that take advantage of this and specific text information in packages.

The intuition is that providing these tools allows people to find the required package and avoid unnecessary package multiplication. I know that Juliahub has some of these capabilities, but the idea is to access them directly from the REPL.

More detailed,

  1. propose adding some optional metadata information metadata in README.md or Project.toml. The idea is to add valuable data to categorize (topics, descriptions) the package, its current objective (e.g., experimental, user ready, etc.), and so on.
  2. full-text indexes for package names, descriptions, topics, readme, and perhaps others. Make all these searchable from the command line, perhaps from a REPL mode, so it could be easy to select topics or search for packages doing something instead of just remembering names (solving the Cthulhu name dilemma…)
  3. index functions and their documentation of all packages such that any developer can search in the same way that (2)
  4. provide access to the (automatic and non-automatic) categorization of packages from REPL.

It is essential to be in pure Julia to simplify anyone running on its computer without needing a dedicated server (without extra setup). Of course, keeping indexes and data updated, keeping running services (or load data to solve each query) can be challenging. So, this task can be centralized, and consumers access it through a json-api client, but in principle, anyone should be able to run these tools on their computer. An alternative way is to produce binary bundles easily updated by fetching new ones.

I created full-text indexes to exemplify the idea.

I created two notebooks showing examples of working with text-package indexes, visualizing, and clustering them.

  • https://github.com/sadit/Search-Julia-Packages/blob/main/search.ipynb shows an example of performing searches in different indexes; note that it just makes a JSON dump of registers.
  • https://github.com/sadit/Search-Julia-Packages/blob/main/clustering.ipynb a UMAP visualization and DBSCAN clustering. The text is represented as a bag of words, removing some stopwords. Using cosine similarity, I computed the all-knn graph to compute a non-linear dimensional reduction (UMAP 2d and 3d) to visualize clusters. The notebook uses DBSCAN (estimating some reasonable epsilon parameters on 2d) to produce a list of clusters now using L2 on the 2d embedded dataset. I saved intermediate data since they can be used to create interesting things, like graph analysis and visualizations.

These examples use bm25 indexes and a bag of words since they are pretty fast and explainable. Also, very scalable. It is also possible to use dense retrieval at the cost of using high-cost language models.

Summary

Of course, this is just a proof of concept or even less. Still, it is possible to use several existing packages to tackle, or at least help overcome, the discoverability and fragmentation issues, giving search tools and categorization tools. It can work in a short time. It can evolve and improve over time and be orthogonal to other solutions.

I will be happy to discuss and collaborate on an information ecosystem for packages.

6 Likes

Should we merge this with Fixing Package Fragmentation - #47 by jlapeyre ?

It is okay for me

Is this close to providing a list of packages it considers “in the same cluster”. Can we see what packages it identified?

Juliahub has a clustering feature that found some useful packages for me. The problem is not so much the quality of their match, but whether it’s closely integrated into the community that people will use it.

It is possible to compute a cluster, as shown in

https://github.com/sadit/Search-Julia-Packages/blob/main/clustering.ipynb

This clustering example is based on the README content and the package’s names. It is also possible to use descriptions and topics easily. I tried to remove stopwords and other Julia language-specific common words to reduce the similarity induced by them.

I used DBSCAN on a UMAP projection, which is more or less than some kind of spectral clustering in the end.

It is necessary to discover what is more beneficial for most people to try to capture in the clustering.

If you are looking for similar packages, the solution could be

https://github.com/sadit/Search-Julia-Packages/blob/main/search-pkgs.ipynb

Changing different weights and numbers of retrieved neighbors (see search_pkgs at src/packages.jl)

Of course, this solution requires improvements in parsing and how the collected metadata is used.