Package size and scope?

Hi All –

Recently started developing packages. Was wondering how everyone feels about package size and scope.

Do you feel like there is a minimum or maximum size/scope a package should be?

As a Developer:

Do you include just a few things in a package (and do them well)? Or do you put in everything but the kitchen sink?

As a User:

Do you include a package with 10 functions to just use 1 of them? Or do you not want to be bothered with a la carte packages?


For context, (and a tad less biased)

Leftpad comes to mind as an example of a package that probably went too small:

NPM & left-pad: Have We Forgotten How To Program?

// haven’t seen this explicitly discussed anywhere. so i’d just like to hear some people’s thoughts

1 Like

This was discussed previously on another thread, but I can’t locate where the discussion took place.

The current package manager was getting very slow due to the large number of packages. So if you have a bunch of related features, it would make sense to make a single package for that, so as not to make the number of packages unnecessarily huge.

If you have a bunch of coherent functionality that depends on each other to provide an API, then make a single package out of that.

You probably don’t need to split it up into multiple packages unless you have a particularly huge set of functionality. If you really have so much related functionality, like DifferentialEquations.jl, then it makes sense to split it into separate packages, since many of those components can live independently and offer a full set of related functionality that belong together, yet can be independent of the entire DifferentialEquations ecosystem.

But if you have, say 10 related functions that help accomplish some task in a specific area, then it would make sense to keep them together. 10 functions is not a critical mass of functionality. However, if your 10 functions are completely unrelated to each other, then it doesn’t make sense to package them together.

If your functionality consists of a single function with only say 10 lines of code, it might not even make sense to make it into a registered package at all, since it is not really an API, then it might make more sense to just write a blog post or a discussion post about it, or make a Jupyter notebook.

There probably isn’t a maximum size for a package, but maybe a good guideline is the Unix philosophy:

However, I would prefer not seeing lots of tiny packages with single functions in them.

This whole NPM disaster thing is not properly following the UNIX philosophy.

Consider this quote from Einstein:

Make everything as simple as possible, but not simpler.

Not everything is meant to be a package, some snippets of code are better treated as examples for blogs, discussion threads, or Jupyter notebooks, or gists.

So it’s a fine balance point. Use fine judgement and wisdom.

As I said, this has been discussed before, but it’s buried in some other thread somewhere.

2 Likes

Another note on Einstein’s quote:

Making everything as simple as possible, but not simpler: for example, if you have a bunch or related functions that work together, then it is simpler to make it into a single package. If you have a bunch of related functionality that is very complex and some of it can live independently, then it is simpler to split it up.

I’ll add some perspective as an active maintainer of the HTTP.jl package. In the Julia early days (circa 2012), there was a Hacker School project to put basic web functionality together in the form of the HttpServer, HttpCommon, Requests, and URIParser packages. Due to the transient/short-lived nature of Hacker School, these packages came out w/ an initial “bang” of functionality and usefulness, and then were hardly touched for years. Functionality was slowly duplicated across these packages as one-off contributors tried to fix a certain issue. Duplicate issues were also filed across these repos as users had a hard time knowing exactly which package was the exact cause of their issue.

HTTP.jl was born w/ the goal of modernizing the foundational webstack code in Julia and providing a cleaner/easier path forward in terms of maintenance. It began literally by merging the git histories/repos of the mentioned packages above and consolidation/enhancements began. In this case, merging the packages has led to an overall cleaner code organization, great reduction in duplicate “utils” functions, and an easier “one-stop-shop” for users when they need web functionality or have web-related issues. It’s also much easier to maintain as there is a single package’s tests to be run w/ enhancements/improvements, as well as a single package to tag/release.

Now, there are obviously pieces of HTTP.jl that would be safe/nice to split off into dedicated packages: the HTTP.URIs module, for example, has fairly mature code, straightforward interface, and you would expect very little in terms of needed enhancements or issues. Also w/ the HTTP.Nitrogen module, which provides all the server functionality; it’s not quite as tightly coupled w/ the rest of the package and there are plenty of user use-cases that involve making requests, but not needing server functionality.

Anyway, for the moment, this has been a great solution that has kept basic web functionality active and maintained for Julia, even if it goes against traditional “unix” philosophy.

5 Likes

It is impossible to draw a clear line, but I think “abstraction level” is a key factor to determine whether a code should be a package or not. Abstraction is arguably the most important concept in any programming language; it makes concrete procedure abstract and frees users from details.

If your package abstracts some procedure at a high level, I think it is worth packaging and registering it as a public package. For instance, let’s consider an imaginary package, Sorting.jl, which offers a sort function to sort elements in an array (of course, we know the sort function in Base, but here we assume there was no such function in Base). I think this is a kind of high-level abstraction because there are so many sorting algorithms and there are various ways to implement an algorithm. However, once we abstract it as the Sorting.jl package, we don’t need to care about its internals and we can leverage our productivity. On the opposite extreme, if we create SumOfSecondAndThirdElementsOfAnArray.jl, the implementation would be straightforward and there is no abstraction at all.

1 Like

In practice, my lower bound for packaging something is that

  1. I use it in multiple places, and
  2. it benefits from unit testing and CI.

So besides code reuse, a major benefit of packages for me is that I can set up CI for them. The upper bound (breaking up code into smaller packages) is even more fuzzy; it should make sense conceptually and provide for clean APIs.

Julia packages are really lightweight. It takes a few minutes to create one (with CI and code coverage tools set up automatically), which makes fixed cost trivial. So a lot of small packages are expected. I think this is good, even for one-liners that one could replace with equivalent code.

As an example, consider ArgCheck.jl, which allows

@argcheck x < 0

instead of something like

x < 0 || throw(ArgumentError("x < 0 must hold"))

It is not that I am saving about 20 characters, but that the intent is communicated more clearly. Totally fine as a small package which does one thing and does it well. If it was buried in SomeCollectionofUtilities.jl I may not bother importing all of those. I like modularity.

1 Like

<offtopic>

One of those often cited quotes for which there seems to be no direct evidence. Could be that it was told verbally, though it is also speculated that it’s a paraphrase of

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.

which can be found in

Einstein, A. (1934). On the Method of Theoretical Physics. Philosophy of Science, 1(2), 163-169. Retrieved from On the Method of Theoretical Physics on JSTOR

see this pdf page 4 paragraph 3: https://www.stmarys-ca.edu/sites/default/files/attachments/files/On_The_Method_of_Theoretical_Physics.pdf

</offtopic>

6 Likes

I really think this recent trend of registering one-function-packages is harmful for the Julia package ecosystem in the long run. Right now, it seems not too uncommon for a registered packages (that actually provide valuable functionality) to also provide 2-3 tiny spin-off packages.

These spin-off package often only contain a single utility function (which typically is a Base function specialized on some arguments) and has almost no use, outside the original package it was created in.

This will make finding packages that actually does valuable stuff harder, it will bloat dependency lists and making it more unclear what a package depends on, making diff lists when upgrading packages larger, create more overhead when it comes to reviewing tagging packages, CI, make it harder for new contributors since they have to try get an overview of the whole dependency chain and how everything fits together etc etc.

Keep your utility functions inside your packages. Only split out stuff into its own package if this provides a significant value on its own and will independently be developed. Don’t split something out because you think that it will be a large independent thing in the future, wait until it has actually happened from developments inside the main package. That is my opinion.

12 Likes

I think this is the right approach, but do you think that the problem you describe is happening in practice? With the registered packages I use, I did not see this trend, almost the opposite. For example, almost embarrassingly, sometimes I just use Lazy.jl for @forward (and I am fine with that, no need for a separate package).

From reviewing METADATA? Yes, I see quite a few packages on the borderline of too small so I leave someone else to make the decision of what to do. I think this happens a lot. It’s just, these are the packages people don’t tend to use…

I will say that a package which just defines a single type and a bunch of useful overloads is very nice though. That’s a small package that I like.

2 Likes

One example I think highlights this question is the PR for:

  • Pluck.jl – A package for plucking random elements from containers

The package may be worthless, but it shows the the trade-off between modularity and intricacy


From the PR,

@kristoffer.carlsson makes the totally valid claim that:

This is called sampling (with our without replacement) and is provided by http://juliastats.github.io/StatsBase.jl/latest/sampling.html#Sampling-from-Population-1.

but my retort is that it:

Seems a little like bloatware to load 15k+ LOC for a sample function?

// why do you need statistics to pull a random item out of a hat?


The package probably should be left in the dust. But I think it does beg the question,

Would an independent Sample (or Pluck) package develop better in isolation without living in the Statistics ecosystem?

edit: …and would the added benefit be worth the inconvenience it places on others (by storing it in METADATA)?

How about creating another (official) list of tiny packages for thoese dedicated small stuff to register in. Like gist in github, npm in node.

2 Likes

@innerlee, that sounds pretty interesting!

  • Maybe that could work in Pkg3?
  • But how would packages from the two different repo banks work together (what if Foo.jl is in both and they each have a bar method?

Also, the number from the following comment come from the table below

Lines of Code in Pluck:

Package LOC
Total 31
Pluck 31

Lines of Code in StatsBase:

Package LOC
Total 17.7k
DataStructures 5,924
StatsBase 5,715
SpecialFunctions 1,825
BinDeps 1,816
Compat 1,189
SHA 758
URIParser 497

// using src dir files only

1 Like

IMO the problem is that Pluck isn’t really a package, its one, one-line, utility function that you define locally if you need it. If you need more serious sampling you need to resort to more comprehensive packages like StatsBase anyway. Also loading code shouldn’t be a problem, we all happily load Base everytime we start julia, and most users don’t use everything in there either.

2 Likes

The trade-off here seems to be between computer time (loading that package) and programmer time (single package vs breaking it up to smaller packages, but at the same time keeping them in sync and aiming for a well-designed API). Given that

julia> @time using StatsBase
  0.090097 seconds (36.51 k allocations: 2.486 MiB, 63.90% gc time)

wasting programmer time instead of computer time on this may not be justified.

A single random item is of course not challenging. Multiple random items, with or without replacement, possibly with weights, are trickier.

I agree with this.

Basically, if you’re bothered by the loading time of StatsBase (0.16 seconds here) then the burden of searching, cloning, or writing a slimmed version of what you need should fall on you.

I guess it falls on what’s worse: including one-line packages in METADATA or a 0.16 loading time for some packages.

Another important metric is:

julia> @time Pkg.add("StatsBase")
INFO: Updating cache of SpecialFunctions...
INFO: Installing BinDeps v0.7.0
INFO: Installing Compat v0.37.0
INFO: Installing DataStructures v0.7.2
INFO: Installing SHA v0.5.2
INFO: Installing SpecialFunctions v0.3.5
INFO: Installing StatsBase v0.19.1
INFO: Installing URIParser v0.2.0
INFO: Building SpecialFunctions
INFO: Package database updated
  9.210089 seconds (5.45 M allocations: 373.096 MiB, 2.71% gc time)

julia> @time Pkg.clone("https://github.com/djsegal/Pluck.jl")
INFO: Cloning Pluck from https://github.com/djsegal/Pluck.jl
INFO: Computing changes...
INFO: No packages to install, update or remove
  3.621915 seconds (8.33 M allocations: 537.656 MiB, 17.28% gc time)

// note: I warmed up Pkg.add and Pkg.clone before doing the timing

I am not so sure about that. Do you wipe your package directory on a regular basis?

1 Like

If we bikeshed this some more, we can see that Pluck downloaded about 200 times slower than StatsBase when comparing speeds in downloaded LOC/sec

1 Like