"Names" packages?

The issue in this post has come up a few times in different contexts:

Julia’s multiple dispatch is very convenient for users, but the “who owns this name” issue can be awkward for package developers.

This is common enough that I wonder if we should have collections of “namespace packages”. For example, JuliaStats could have a StatsNames.jl that would “own” lots of stats-related names, and have no functionality beyond this. This could be a common and very lightweight dependency.

The potential benefits are pretty clear, but I can see a few potential problems:

  1. Methods on Base types would constitute type piracy
  2. We’d have to come to an agreement about the high-level semantics of various functions and reasonable return types. It would be important for this to be sensible without being overly restrictive.
  3. Even a small degree of bureaucracy has the potential to discourage new developers and slow growth of the ecosystem
  4. Many function names are already tied to existing packages. If these have methods for types those packages don’t own, it could be difficult to extricate them.

Related to this last point, many names correspond to structs, and I think it’s safe to assume we’d need to leave those alone.

Anyway… If this worked out, the end result could be very nice - an ecosystem with a consistent “look and feel”. I can see some tradeoffs, but some of it’s not very clear to me yet.

What do you think? Is this a good approach? Is there a better way to approach this problem?

5 Likes

We already have one for CommonSolve.jl. I think some others could be needed. Making sure to avoid ambiguities is the real issue.

3 Likes

I think GitHub - JuliaData/DataAPI.jl: A data-focused namespace for packages to share functions is a good example of this, with some good patterns to borrow, e.g. every function has a single “owner” who is allowed to define generic fallbacks / methods for Base types; everyone else can extend it but cannot pirate (to avoid clashes).

6 Likes

I am not sure if an undisciplined collection of names, or even a collection of names and associated “general idea” labels, is a good idea. Such a collection risks becoming a source of type puns that confuse users and developers about what contracts the functions are supposed to have.

Two methods whose parameters are different abstract types and have no traits in common are, in my opinion, different functions. For example, in one post @Tamas_Papp said

Even within a single package, two versions of StatsBase.sample have parameters n_chains and nchains. Once multiple packages and dispatch get involved, it’ll be even more confusing.

I think it would be helpful to have declared bounds on what shared parameter names and types or traits a function accepts and returns.

6 Likes

Such packages already exist, and are conventionally named …Base.jl or similar.

Owning and providing a symbol is not sufficient, what is important is to have some definition for an API that uses those names, with changes according to SemVer. This allows dependencies to track them in a clean way.

This also means that it is better to have a small, lightweight package for each API, rather than a kitchen sink with a lot of symbols, which would require bumping versions for each unrelated change.

For statistical models, StatsBase.jl provides this API in a rather nice way. It also has other functionality, so splitting that part could make sense. But other “statistics related” APIs that are really generic should go in similar, small packages.

6 Likes

What do you mean with “some definition for an API”? I do not know StatsBase.jl, but it seems like (and its name suggests that) it is not just an interface, but also a partial / default / fallback (?) implementation, which also provides a type hierarchy.

My concern is that an implementation is usually more opinionated, and its opinions may diffuse into the interface undetected, hindering other implementations. On the other hand I see that providing types and basic implementations can make life easier, and that it is common practice in the Julia world.

As a concrete example, in ActorInterfaces.Classic we already eliminated the Actor type, but we still have Addr. I like it, for me it feels like the Addr type glues together the whole interface into a coherent unit, but Actors.jl uses a different notion of links, so it has the awkward definition: Link{C} <: ActorInterfaces.Classic.Addr. The main point here is that addresses and links have slightly different semantics.

I would like to better understand why/when providing an implementation and root types together with the interface is the good way to go in Julia. I feel that either multiple dispatch or the nature of the community (e.g. open source, communication heavy) makes a difference to other languages, but I was not yet able to grasp it.

1 Like

I can understand the value in what you suggest, but it also imposes some constraints on future packages. For example, StatsBase has

loglikelihood(model::StatisticalModel)

This doesn’t make sense to me; I would have chosen

loglikelihood(model, observation)

returning a function params -> Real.

Similarly, there are problems in Distributions.jl with types being too constrained, so they’re not usable with symbolic values. Changing this becomes increasingly difficult as more packages come to depend on it.

The problem, I think, is that pressure to commit early to this sort of thing leads to a very small window of discussion, followed by an API that’s practically set in stone.

I’ve seen cases where the “name ownership” question causes problems (the OP), and where an overly-strict API causes problems (Distributions.jl). Are there cases where a more permissive approach causes problems, in the presence of multiple dispatch?

4 Likes

I’m not sure what loglikelihood(::StatisticalModel) means either; maybe it’s supposed to return a function.

If StatsBase had just taken the name loglikelihood, then different implementing packages would interpret the intent in different ways. Some might think it’s supposed to take only a model parameter; others would have it take a model and observation; still others, a model with collection of observations. Those are all different mathematical functions (loglikelihood_function, loglikelihood, joint_loglikelihood) and having them all use the same name is confusing. Having them all use the same Julia function is even more confusing, since then even tooling can’t help distinguish them.

1 Like

Did you take a look at DataAPI? There they specify a precise signature and API for their functions, e.g. https://github.com/JuliaData/DataAPI.jl/blob/c46688cce0727cbf6912a03998675db422bb85fa/src/DataAPI.jl#L72-L96. So while I agree that just having an assorted collection of names and nothing more is not a good idea, I also don’t think that’s what is done in practice anyway.

2 Likes

That’s what I’m advocating. :slight_smile:

My point is that when you decide your version of f() takes/returns different abstract types/traits than the original, it’s time to use a different function rather than repurposing the original.

1 Like

A consistent definition of an interface. Eg the ones in Base.

Possibly, but this does not have to be the case. Julia has a lot of well-designed interfaces, both in Base and the package ecosystem. Whether a fallback/default implementation makes sense depends on circumstances.

Sure, all interfaces do. Designing APIs is very tricky in general, and in Julia this is complicated by performance concerns (both runtime and compile time). This requires a lot of iteration and participation from stakeholders, and occasionally a breaking change when it is warranted — this is what SemVer is for.

Eg in a future major version of StatsBase, loglikelihood could be fixed, even though for this particular package that is less likely because there is no single maintainer with a unifying vision. Or packages that need the concept of a loglikelihood along the lines you suggest (which I agree with, incidentally) could start their own lightweight API package.

Not necessarily. A package I consider exemplary both in terms of the process and the end result is

And, again, if an API is unsatisfactory, that’s what breaking changes are for. Nothing is fixed forever.

3 Likes

I agree APIs are important and useful. The problem they address is related to the one I’m seeing, but not entirely the same.

Suppose an API exists but doesn’t work well for your approach. Maybe there are some philosophical differences, or maybe the API was designed with assumptions that constrain the design space in a way that makes the approach you’d like to take awkward or impossible.

The obvious solution is to have a discussion with whoever is maintaining the API, and update it to suit your needs without having too great an impact on other packages. In practice, this often just doesn’t work, for a few reasons.

First, there’s usually some inherent resistance to a new approach. It can sometimes be seen as “different just to be different”. People in general often see things we have a deep understanding of as “the right way to do it” until there’s strong evidence to the contrary. And even if devs are in principle open to change, it’s sometimes clear there’s a strong preference to just leave things as they are.

Second, APIs are not developed in a vacuum. API designers almost always have particular implementations in mind as they build the abstractions. Once we have an implementation we like, we can have trouble seeing its shortcomings.

Finally, the difficulty of changing an API increases with the number of its users and reverse dependencies. The cost of negotiation can quickly become discouraging.

With all of this, it’s understandable why a new library might go outside an ill-fitting API just to get things working. But then once things work and both the API and the new library gain users, change becomes harder still, and the ecosystem more fragmented.

In any case, it’s hard to imagine a time when ever use of every name complies neatly with some widely-used API. The challenge is how to make things easily interoperable even when this isn’t the case.

1 Like

If I understand right, we’re talking about merging two functions with different semantics into the same function object? Could you explain what you see as the benefits of doing that?

1 Like

If they are not describing the same generic concept then they should not be methods of the same function. That’s why Julia have namespaces so you can do Game.push!(obj::Game.Object, p::Game.Player) to have a player push an object without it having anything to do with the Base.push!. Using the same names for different purposes is already easy, the solution is module namespaces.

2 Likes

Sorry to interfere, but it seems that discussion has deviated from the original post.

If I am getting it right, than problem is not in the API or meanings, but in the fact that

module A
    export foo
    struct A1 end
    foo(x::A1) = "hello"
end

module B
    export foo
    struct A2 end
    foo(x::A2) = "world"
end

using .A
using .B

julia> foo(A.A1())
WARNING: both B and A export "foo"; uses of it in module Main must be qualified
ERROR: UndefVarError: foo not defined

So, if I have some package with some functionality and for some reason implement function with the same name as in another package, suddenly everything is broken and user should use fully qualified names, which can be really tedious.

As a more concrete example, DataFrames.jl implements innerjoin function. Now, if I am implementing different data structure which also implements this function, then user workflow would look like

using DataFrames

innerjoin(df1, df2)

ok, let’s add another library

using DataFrames
using SomeOtherLibrary

DataFrames.innerjoin(df1, df2)

It’s really inconvenient and it would be much better if DataFrames and all other data related packages just import innerjoin from ome lightweight package, which only includes this common names.

I get it from this discussion, that this idea is somehow wrong, but I can’t quite understand why.
Sorry if it is me, who deviate the discussion :slight_smile:

2 Likes

Say we have two packages that export a foo function. using both of them forces the name to be qualified. The problem is similar to type piracy: importing one package changes the behavior of another.

We mostly have two extremes: either qualify all names, or follow a common API so the exact calls are the same.

What I’m suggesting is the possibility of a middle ground. If the methods we define are only on names we “own”, we ought to be able to still share the function and let multiple dispatch handle things for us.

To some extent, the problem can be addressed by expecting devs to export fewer names, or for users to only use using with specific names. I think the problem comes when we really want to export something and know it can in principle be used with a definition from another package, though the uses might be different.

I think you describe the problem well, thank you for the example!

1 Like

This is a good idea in general since it enhances readability. It’s a bit annoying to type, but IDE tooling can help by automatically inserting using Foo: bar, baz statements names at the top of a file.

The way I like to see multimethods is something like this diagram, where the Latin-lettered objects are in some sense related to the Greek-lettered objects. For example, incrementing a : \mathbb{R} by 1 is somehow equivalent to incrementing \alpha : \mathbb{Z} by 1, only in a different domain.

That’s exactly on point, not a diversion :slight_smile: .

I’ll take the example of DataFrames.select, which has

select(df::AbstractDataFrame, args...; copycols::Bool=true, renamecols::Bool=true)

This function has the “general idea” of “choose columns from a table”. It copies the selected columns by default, which causes a significant slowdown on large tables and deviates from the common Julia semantics of choosing elements from a collection (which doesn’t usually copy them). (BTW, I find that Bool flag arguments are often a sign of two functions living under the same name, which could be cleaned up by splitting them.)

Now if I define My.select which is another implementation of the “choose columns from a table” idea, I don’t want to use the column-copying implementation, but I do still want it to work on DataFrames. If I extend the hypothetical DataFramesFunctionNames.select(), my implementation is in conflict with DataFrames.select(), and I’m doing type piracy, which causes dispatch ambiguity in two senses: (1) the technical sense of which method should Julia use for DataFramesFunctionNames.select(::DataFrame) and (2) the semantic sense, of whether the columns should be copied. Even if I don’t need it to work on DataFrames, it will still be confusing about whether columns will be copied. Therefore, I will keep My.select separate from DataFrames.select to avoid these problems, even though they both implement the same “general idea”.

1 Like

This is not a problem, it is a feature that protects the user/programmer. The two names denote different things, and should not be automatically conflated. See this epic thread:

3 Likes

To solve the issue of having to manually import names, there is this julia-vscode issue:

https://github.com/julia-vscode/julia-vscode/issues/1925

I think it depends on the context. Some packages are designed to be used interactively by “end users”. I’d like to make things easy for them - the should be able to say using Foo and have things just work.

Right, but this is not the case I’m addressing. It’s more like the StatsBase.loglikelihood case. StatsBase is a pretty light-weight package, but suppose that weren’t the case. I think it’s natural to want a situation where users can have both StatsBase and my package loaded without a sudden change of behavior, and I wouldn’t have to depend directly on StatsBase.

What I’m suggesting for a case like this is that a new StatsNames.jl could contain loglikelihood. The “contract” would involve roughly what a log-likelihood is, and that there should be no type piracy.

What’s not needed is a spec for a particular sequence of arguments. Thanks to multiple dispatch, we can add whatever methods we like. We already do this all the time within a single package.

A couple of years ago I would have agreed with this. But this is a dynamic language with multiple dispatch. As long as we avoid type piracy, we can add new methods with wild abandon. I see a lot more risk in not allowing this sort of thing. We get locked in to particular argument types, making it hard for new ideas to take hold.

With multiple dispatch, we can have more of a “free market” approach. With multiple methods in a single package, some will become more widely-used than others. Those that take off can be adopted by other packages.

With apologies to @cpfiffer and @devmotion, here’s an example from AbstractMCMC.jl:

StatsBase.sample(
    [rng::Random.AbstractRNG,]
    model::AbstractMCMC.AbstractModel,
    sampler::AbstractMCMC.AbstractSampler,
    nsamples[;
    kwargs...]
)

Among the kwargs is a way to specify what type of result should be returned. To me that’s not very natural; I’d rather dispatch on the type. Luckily, I’m not bound to that API; I can instead define

function sample(rng::AbstractRNG, 
    ::Type{DynamicHMCChain}, 
    m::ConditionalModel,
    nsamples::Int=1000,
    nchains::Int=4)

In this case I used my own sample, but there would be no problem using the one from StatsBase, at least not that I can see.

And again, the problem I’m trying to solve isn’t really a problem for me, but for the end user. I’d like to avoid the situation where using packages together suddenly changes what’s easily available in help (“?sample”) or methods, which can be especially confusing for beginners.

2 Likes