Package size and scope?

djsegal · November 25, 2017, 7:38pm

Hi All –

Recently started developing packages. Was wondering how everyone feels about package size and scope.

Do you feel like there is a minimum or maximum size/scope a package should be?

As a Developer:

Do you include just a few things in a package (and do them well)? Or do you put in everything but the kitchen sink?

As a User:

Do you include a package with 10 functions to just use 1 of them? Or do you not want to be bothered with a la carte packages?

For context, (and a tad less biased)

Leftpad comes to mind as an example of a package that probably went too small:

NPM & left-pad: Have We Forgotten How To Program?

// haven’t seen this explicitly discussed anywhere. so i’d just like to hear some people’s thoughts

chakravala · November 25, 2017, 10:22pm

This was discussed previously on another thread, but I can’t locate where the discussion took place.

The current package manager was getting very slow due to the large number of packages. So if you have a bunch of related features, it would make sense to make a single package for that, so as not to make the number of packages unnecessarily huge.

If you have a bunch of coherent functionality that depends on each other to provide an API, then make a single package out of that.

You probably don’t need to split it up into multiple packages unless you have a particularly huge set of functionality. If you really have so much related functionality, like DifferentialEquations.jl, then it makes sense to split it into separate packages, since many of those components can live independently and offer a full set of related functionality that belong together, yet can be independent of the entire DifferentialEquations ecosystem.

But if you have, say 10 related functions that help accomplish some task in a specific area, then it would make sense to keep them together. 10 functions is not a critical mass of functionality. However, if your 10 functions are completely unrelated to each other, then it doesn’t make sense to package them together.

If your functionality consists of a single function with only say 10 lines of code, it might not even make sense to make it into a registered package at all, since it is not really an API, then it might make more sense to just write a blog post or a discussion post about it, or make a Jupyter notebook.

There probably isn’t a maximum size for a package, but maybe a good guideline is the Unix philosophy:

However, I would prefer not seeing lots of tiny packages with single functions in them.

This whole NPM disaster thing is not properly following the UNIX philosophy.

Consider this quote from Einstein:

Make everything as simple as possible, but not simpler.

Not everything is meant to be a package, some snippets of code are better treated as examples for blogs, discussion threads, or Jupyter notebooks, or gists.

So it’s a fine balance point. Use fine judgement and wisdom.

As I said, this has been discussed before, but it’s buried in some other thread somewhere.

chakravala · November 25, 2017, 10:38pm

Another note on Einstein’s quote:

Making everything as simple as possible, but not simpler: for example, if you have a bunch or related functions that work together, then it is simpler to make it into a single package. If you have a bunch of related functionality that is very complex and some of it can live independently, then it is simpler to split it up.

quinnj · November 26, 2017, 3:58am

I’ll add some perspective as an active maintainer of the HTTP.jl package. In the Julia early days (circa 2012), there was a Hacker School project to put basic web functionality together in the form of the HttpServer, HttpCommon, Requests, and URIParser packages. Due to the transient/short-lived nature of Hacker School, these packages came out w/ an initial “bang” of functionality and usefulness, and then were hardly touched for years. Functionality was slowly duplicated across these packages as one-off contributors tried to fix a certain issue. Duplicate issues were also filed across these repos as users had a hard time knowing exactly which package was the exact cause of their issue.

HTTP.jl was born w/ the goal of modernizing the foundational webstack code in Julia and providing a cleaner/easier path forward in terms of maintenance. It began literally by merging the git histories/repos of the mentioned packages above and consolidation/enhancements began. In this case, merging the packages has led to an overall cleaner code organization, great reduction in duplicate “utils” functions, and an easier “one-stop-shop” for users when they need web functionality or have web-related issues. It’s also much easier to maintain as there is a single package’s tests to be run w/ enhancements/improvements, as well as a single package to tag/release.

Now, there are obviously pieces of HTTP.jl that would be safe/nice to split off into dedicated packages: the HTTP.URIs module, for example, has fairly mature code, straightforward interface, and you would expect very little in terms of needed enhancements or issues. Also w/ the HTTP.Nitrogen module, which provides all the server functionality; it’s not quite as tightly coupled w/ the rest of the package and there are plenty of user use-cases that involve making requests, but not needing server functionality.

Anyway, for the moment, this has been a great solution that has kept basic web functionality active and maintained for Julia, even if it goes against traditional “unix” philosophy.

bicycle1885 · November 26, 2017, 6:12am

It is impossible to draw a clear line, but I think “abstraction level” is a key factor to determine whether a code should be a package or not. Abstraction is arguably the most important concept in any programming language; it makes concrete procedure abstract and frees users from details.

If your package abstracts some procedure at a high level, I think it is worth packaging and registering it as a public package. For instance, let’s consider an imaginary package, Sorting.jl, which offers a sort function to sort elements in an array (of course, we know the sort function in Base, but here we assume there was no such function in Base). I think this is a kind of high-level abstraction because there are so many sorting algorithms and there are various ways to implement an algorithm. However, once we abstract it as the Sorting.jl package, we don’t need to care about its internals and we can leverage our productivity. On the opposite extreme, if we create SumOfSecondAndThirdElementsOfAnArray.jl, the implementation would be straightforward and there is no abstraction at all.

Tamas_Papp · November 26, 2017, 7:30am

In practice, my lower bound for packaging something is that

I use it in multiple places, and
it benefits from unit testing and CI.

So besides code reuse, a major benefit of packages for me is that I can set up CI for them. The upper bound (breaking up code into smaller packages) is even more fuzzy; it should make sense conceptually and provide for clean APIs.

Julia packages are really lightweight. It takes a few minutes to create one (with CI and code coverage tools set up automatically), which makes fixed cost trivial. So a lot of small packages are expected. I think this is good, even for one-liners that one could replace with equivalent code.

As an example, consider ArgCheck.jl, which allows

@argcheck x < 0

instead of something like

x < 0 || throw(ArgumentError("x < 0 must hold"))

It is not that I am saving about 20 characters, but that the intent is communicated more clearly. Totally fine as a small package which does one thing and does it well. If it was buried in SomeCollectionofUtilities.jl I may not bother importing all of those. I like modularity.

Evizero · November 26, 2017, 9:45am

<offtopic>

One of those often cited quotes for which there seems to be no direct evidence. Could be that it was told verbally, though it is also speculated that it’s a paraphrase of

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.

which can be found in

Einstein, A. (1934). On the Method of Theoretical Physics. Philosophy of Science, 1(2), 163-169. Retrieved from On the Method of Theoretical Physics on JSTOR

see this pdf page 4 paragraph 3: https://www.stmarys-ca.edu/sites/default/files/attachments/files/On_The_Method_of_Theoretical_Physics.pdf

</offtopic>

kristoffer.carlsson · November 26, 2017, 12:21pm

I really think this recent trend of registering one-function-packages is harmful for the Julia package ecosystem in the long run. Right now, it seems not too uncommon for a registered packages (that actually provide valuable functionality) to also provide 2-3 tiny spin-off packages.

These spin-off package often only contain a single utility function (which typically is a Base function specialized on some arguments) and has almost no use, outside the original package it was created in.

This will make finding packages that actually does valuable stuff harder, it will bloat dependency lists and making it more unclear what a package depends on, making diff lists when upgrading packages larger, create more overhead when it comes to reviewing tagging packages, CI, make it harder for new contributors since they have to try get an overview of the whole dependency chain and how everything fits together etc etc.

Keep your utility functions inside your packages. Only split out stuff into its own package if this provides a significant value on its own and will independently be developed. Don’t split something out because you think that it will be a large independent thing in the future, wait until it has actually happened from developments inside the main package. That is my opinion.

Tamas_Papp · November 26, 2017, 12:35pm

I think this is the right approach, but do you think that the problem you describe is happening in practice? With the registered packages I use, I did not see this trend, almost the opposite. For example, almost embarrassingly, sometimes I just use Lazy.jl for @forward (and I am fine with that, no need for a separate package).

ChrisRackauckas · November 26, 2017, 1:13pm

From reviewing METADATA? Yes, I see quite a few packages on the borderline of too small so I leave someone else to make the decision of what to do. I think this happens a lot. It’s just, these are the packages people don’t tend to use…

ChrisRackauckas · November 26, 2017, 1:17pm

I will say that a package which just defines a single type and a bunch of useful overloads is very nice though. That’s a small package that I like.

djsegal · November 28, 2017, 12:25am

One example I think highlights this question is the PR for:

Pluck.jl – A package for plucking random elements from containers

The package may be worthless, but it shows the the trade-off between modularity and intricacy

From the PR,

@kristoffer.carlsson makes the totally valid claim that:

This is called sampling (with our without replacement) and is provided by http://juliastats.github.io/StatsBase.jl/latest/sampling.html#Sampling-from-Population-1.

but my retort is that it:

Seems a little like bloatware to load 15k+ LOC for a sample function?

// why do you need statistics to pull a random item out of a hat?

The package probably should be left in the dust. But I think it does beg the question,

Would an independent Sample (or Pluck) package develop better in isolation without living in the Statistics ecosystem?

edit: …and would the added benefit be worth the inconvenience it places on others (by storing it in METADATA)?

innerlee · November 28, 2017, 1:15am

How about creating another (official) list of tiny packages for thoese dedicated small stuff to register in. Like gist in github, npm in node.

djsegal · November 28, 2017, 5:30am

@innerlee, that sounds pretty interesting!

Maybe that could work in Pkg3?
But how would packages from the two different repo banks work together (what if Foo.jl is in both and they each have a bar method?

Also, the number from the following comment come from the table below

Lines of Code in Pluck:

Package	LOC
Total	31
Pluck	31

Lines of Code in StatsBase:

Package	LOC
Total	17.7k
DataStructures	5,924
StatsBase	5,715
SpecialFunctions	1,825
BinDeps	1,816
Compat	1,189
SHA	758
URIParser	497

// using src dir files only

fredrikekre · November 28, 2017, 8:09am

IMO the problem is that Pluck isn’t really a package, its one, one-line, utility function that you define locally if you need it. If you need more serious sampling you need to resort to more comprehensive packages like StatsBase anyway. Also loading code shouldn’t be a problem, we all happily load Base everytime we start julia, and most users don’t use everything in there either.

Tamas_Papp · November 28, 2017, 8:49am

The trade-off here seems to be between computer time (loading that package) and programmer time (single package vs breaking it up to smaller packages, but at the same time keeping them in sync and aiming for a well-designed API). Given that

julia> @time using StatsBase
  0.090097 seconds (36.51 k allocations: 2.486 MiB, 63.90% gc time)

wasting programmer time instead of computer time on this may not be justified.

A single random item is of course not challenging. Multiple random items, with or without replacement, possibly with weights, are trickier.

yakir12 · November 28, 2017, 8:53am

I agree with this.

Basically, if you’re bothered by the loading time of StatsBase (0.16 seconds here) then the burden of searching, cloning, or writing a slimmed version of what you need should fall on you.

I guess it falls on what’s worse: including one-line packages in METADATA or a 0.16 loading time for some packages.

djsegal · November 28, 2017, 10:49am

Another important metric is:

julia> @time Pkg.add("StatsBase")
INFO: Updating cache of SpecialFunctions...
INFO: Installing BinDeps v0.7.0
INFO: Installing Compat v0.37.0
INFO: Installing DataStructures v0.7.2
INFO: Installing SHA v0.5.2
INFO: Installing SpecialFunctions v0.3.5
INFO: Installing StatsBase v0.19.1
INFO: Installing URIParser v0.2.0
INFO: Building SpecialFunctions
INFO: Package database updated
  9.210089 seconds (5.45 M allocations: 373.096 MiB, 2.71% gc time)

julia> @time Pkg.clone("https://github.com/djsegal/Pluck.jl")
INFO: Cloning Pluck from https://github.com/djsegal/Pluck.jl
INFO: Computing changes...
INFO: No packages to install, update or remove
  3.621915 seconds (8.33 M allocations: 537.656 MiB, 17.28% gc time)

// note: I warmed up Pkg.add and Pkg.clone before doing the timing

Tamas_Papp · November 28, 2017, 11:37am

I am not so sure about that. Do you wipe your package directory on a regular basis?

yakir12 · November 28, 2017, 11:49am

If we bikeshed this some more, we can see that Pluck downloaded about 200 times slower than StatsBase when comparing speeds in downloaded LOC/sec…

Topic		Replies	Views
Large vs Small Packages General Usage question	22	2471	July 4, 2023
Fixing Package Fragmentation Community	71	5812	May 28, 2023
Dependency policy - should we avoid dependencies or embrace them? General Usage dependencies	20	2214	February 5, 2018
How can we create a leaner ecosystem for Julia? Statistics package , proposal , time-series , machine-learning	101	10132	October 15, 2020
How to know if a package is good? Community	105	6647	June 13, 2022

Package size and scope?

Lines of Code in Pluck:

Lines of Code in StatsBase:

Related topics