How can we create a leaner ecosystem for Julia?

Albert_Zevelev · January 3, 2020, 1:35am

Comparisons of Julia w/ other languages often say one disadvantage of Julia is that it has fewer 3rd party packages.
I tend to disagree. Julia doesn’t need more packages, it needs better packages (w/ more functionality).

For example, two recent posts about time-series (here) & (here) highlight the amount of redundancy & lack of cooperation between Julia developers.

Consider some packages listed in the posts & others I found on my own:
TSAnalysis, ARFIMA,StateSpaceModels, ARMAProcesses, TimeSeries, TimeModels (unmaintained), ARCHModels, Cointegration, FactorModels and thesis, VARmodels, VectorAutoregressions, TSML, ForecastingCombinations, ForecastEval, Financial Risk Forecasting (textbook), VAR with SV, Hamilton Filter & Dynamic Factor, RARIMA, SMC, Temporal, Indicators, QuantEcon, Creel Econometrics, Paul Soderlind, and many others.

Question 1: if I’m using/developing a Julia program, what is an easy way to find all that currently exists in my domain?
A simple Github search for “time-series” (Language=Julia) would miss most of the packages I found above. For example it would miss the Master’s thesis which is assigned the language .tex by Github.

Question 2: like most users, I’d prefer a very small number of carefully optimized, regularly maintained time-series libraries w/ minimal redundancy. How can we increase cooperation to develop a leaner Julia package ecosystem?

PS: I’m only using TS & forecasting as one example of a domain that can be cleaned up

xiaodai · January 3, 2020, 1:44am

~~Perhaps invest a few million$ and hire some devs to make your dream come true?~~
Seems likes good points.

Sorry, @Albert_Zevelev and to the community for being rude. I messed up.

Roger-luo · January 3, 2020, 1:51am

I’m wondering if we should provide some standard keywords to the Pkg registery like what they do in pip (although pip itself might be bad, but I like the keyword idea). To register a package in General the author have to choose at least one keyword, so we can always find all registered packages for a given domain from the project configuration itself instead of a loose github tag (or maybe the package is not hosted on github at all!)

cpfiffer · January 3, 2020, 1:52am

For question 1:

Just look up whatever your specific tool is. If you want Kalman filteration or smoothing, just search for that. Even in R, you don’t really just Google “time series” to find the tools you want – time series analysis is a bunch of distinct methods.

You do have a good point, which is that package maintainers could consider adding time-series as a tag as well, which I think is a good search term. It’s a way to get a little more exposure for free.

For question 2:

Some might think this is a joke comment, but in some sense it’s quite true – pandas in Python benefited a lot from corporate support and firm investment. Most of the packages cited above are hobby projects for people who do other things and are not out to make the world’s greatest time series package, because that is really hard to do and it requires a tremendous amount of expertise from people with very high opportunity costs.

There’s not going to be anything like what you have in mind until there’s more active investment from third-parties.

It’s not really all doom-and-gloom though – if you are interested in taking an active role yourself, you should reach out to package maintainers and discuss common time-series-specific design choices that could be consolidated (much like StatsBase or StatsModels). This is a lot of work for the amount of packages across different toolkits, but if done correctly it could be quite valuable for the ecosystem.

xiaodai · January 3, 2020, 1:59am

Open source funding is a classic economics problem! It’s similiar to the free-loader problem. But it’s amazing that open-source software exists at all!

I wish someone was funding my open source efforts so I can quit my job.

ChrisRackauckas · January 3, 2020, 2:02am

I think the “we” is kind of the issue here. You never really get a large set of packages playing nicely together with just a “we”. Usually it needs one person to really drive such a project and inject some opinion as to the “right” way to do things. This takes a lot of time and effort, but if you look at the larger projects, you see that there’s usually one person who was the full time driver for the first 5+ years. As much as I like them, community efforts aren’t very efficient at building cohesion. Everyone wants to contribute a little piece, but someone has to drive it.

viralbshah · January 3, 2020, 2:28am

@xiaodai This comment is unhelpful and rude. It could literally be the answer to any question about Julia. It is my gentle request to modify it to be positive and helpful.

viralbshah · January 3, 2020, 2:38am

I feel that you need a core of people driving something. I don’t think one is the right number, since it is very easy to get demotivated or go down the wrong path. Usually, 2-3 people working together in my opinion produce high quality packages. But yes, you certainly can’t get high quality packages starting with a crowd.

At the same time, it is not easy to central plan this. We do not know which particular person or set of people is going to get it right, stick with it, and build the community. Thus sometimes multiple efforts are valuable in exploring the design space.

In this particular case, I can imagine that someone trying and writing a blog post about the state of various time series packages might be a good start.

-viral

anon92994695 · January 3, 2020, 2:51am

I’m not super experienced as a developer. But, at work when this sort of stuff started happening someone voiced concerns like this, and then someone with some skills/leadership qualities started looking at the “mess” and seeing patterns that could unify the efforts.

I think with some of these more jumbled spaces we could start to see this.

This is the battle of scaling. I have to admit though, @ChrisRackauckas is right. It’s usually one person fearlessly/obsessively cutting a path. It does take a small team, or single hardheaded person who thinks they can handle it themselves to make rapid and stable progress. There’s downsides to this though, sometimes code bases become illegible, or tainted by confusing design patterns, then no one can contribute. So solving the problem in 2019 becomes a dead end in 2021 or more optimistically 2025, when new things arrive.

I’m planning an experiment with a different type of open source workflow… But, if it falls down to a solodev effort that’s okay too :). I think Julia is the right modality for it, but we’ll see.

I also agree - it doesn’t take money. It takes necessity, interest, and passion. Julia allows for extreme modularity, we can leverage this by suggesting successful design practices. Thats why julia has such an amazing backbone, those things alone.

A good start is making documentation available, and the efforts of others as available as humanly possible. The rest will happen naturally over time.

fipelle · January 3, 2020, 3:45am

Thank you for referencing TSAnalysis.jl.

I think you highlighted some interesting points. My view is the following.

For what concerns the time-series field the number of registered packages compatible with Julia 1 and with unit testing is rather small (a subset of the above).

I am toward the end of my doctorate (one year and a half left ) and my research is mostly on time series. TSAnalysis.jl is still preliminary, but I will consistently add new features (see this link for more details). I plan to cooperate with other developers (co-authors and externals) and I am trying to avoid overlaps whenever possible. However I also want to have control on the most primitive part of my package. It might be a personal limitation, but I aim to:

have enough security to be confident in using TSAnalysis.jl as a basis to write academic papers;
define a solid and modular layout that allows for regular updates.

I suspect that different mantainers might have similar perspectives. Of course, this might create overlaps. That said, users generally tend to concentrate around the most efficient and friendly packages. Git often pushes the most used.

Tamas_Papp · January 3, 2020, 7:44am

It is understood that users would prefer this. But, at the same time, it is very likely that this would happen gradually and organically, and there is very little we can do to speed up the process, other than contributing.

The primary reason for this may be that Julia is a very new language with an unprecedented combination of features (notably parametric types, multiple dispatch, and AOT compilation). Providing some functionality with a performant and well-designed interface is usually more involved than simply porting equivalent libraries from other languages.

Consequently, a lot of packages are experimental, exploring the design space. They may turn into a polished library, get merged when the time is ripe, or abandoned when the author(s) lose interest.

Navigating this situation is not easy. It is not uncommon that one has to look at multiple packages before finding an ideal solution. This is how I usually do it:

Search these forums, Github, and Gitlab, possibly with multiple combinations of keywords, then make a shortlist of 1-3 packages, and evaluate them based on

recent activity (especially for issues and pull requests: are they authors responsive?)
documentation quality (ideally, there is some documentation, or at least docstrings)
look at the source code and unit tests: if they are organized, tidy, and well-documented, the package is more likely be something the authors intend to maintain. Also, well-maintained code makes it easier to contribute, or potentially continue working on the package if the original authors don’t have time.

Albert_Zevelev · January 3, 2020, 9:59pm

Thank you for your comments! I’ll discuss the first (easier) question:
Q1: if I’m using/developing Julia, what is an easy way to find all code in my domain?

1 From (here) & (here) it is clear that Julia users are having trouble finding code in their domain.
I wouldn’t have found many of the 20+ links I posted searching GitHub or Google. I knew about Financial Risk Forecasting, QuantEcon, Creel, and Soderlind before I heard of Julia.

2 @Roger-luo suggests standard keywords in the Pkg registry like in pip.

3 May I suggest a pinned post @ the top of each domain in Discourse?
For example, in the Statistics category description:
(3.A) we can post links to time-series code (including the 20+ links above), and let users comment which packages are missing. When someone has an announcement about a new package, they can add their package to the list as well.
(3.B) users can post desired functionalities for that domain

Similarly for Data, Finance/Economics, Astro/Space etc.

I’m personally interested in all ML packages for Julia & it’s harder to find things than you’d expect.
This resource would be great for developers working on ML interfaces such as MLJ.

4 Individuals who have attempted to track Julia packages in various domains, have incomplete lists & rarely update those lists (here, here, here).
Hence, this resource is more likely to be maintained in an official location (such as Discourse).
Perhaps put someone in charge of overseeing each list & they can pass the baton to someone else after a year?

ChrisRackauckas · January 3, 2020, 11:03pm

Why not use pkg.julialang.org?

chakravala · January 3, 2020, 11:32pm

This statement makes a fundamental mistake in understanding open source software.

The reason why so many packages exist is not because of a meme like “need more packages”. It is because somebody wanted to do some programming and create something for whatever reason, and they decided to share the result of their work online for free. That person who made that piece of code does not have the responsibility of organizing it into a bigger framework.

The way I see it, there are people who make things available for free, and either you like that code they made or you don’t. By saying that you’d prefer a leaner ecosystem, you are implying that you’d rather not have people share their code for free. It is not the responsibility of people making code available for free to make it available in such a way that makes you happy.

It would certainly be nice to make a more unified and coherent package ecosystem, but it is not anyone’s responsibility to do that unless there is some kind of incentive to motivate it for those people.

There wasn’t necessarily any incentive for the developers of those packages to organize their efforts. They may have been able to accomplish their own goals without satisfying your goal. One way to overcome this would be either do the organizational work yourself or to hire the developers to do it.

People working for free should not have any expected responsibility for package cohesion, unless they have some sort of incentive or motivation based on their own work or interest or funding.

Of course, it would be good to encourage a more cohesive package ecosystem, but one cannot create an expectation that random strangers on the internet (who are working for free) do this for you.

oxinabox · January 3, 2020, 11:59pm

I feel like while this is technically true, this observation doesn’t fundermentally change as much as might first appear.

And the reason it doesn’t change much is as you say in:

People making open source packages do so from some motivation.
And I believe: that in most (but not all) cases that same motivation also favors cohesion.
As a second statement to that: as a rule the majority of people who do not have an interest in promoting cohesion are not reading this thread.

I maintain DataStructures because I feel its good for the world, and similarly even today I was working though making changes to help make it more coherant with other packages (in todays case the standard library). Because I think that is good for the world.
I make PRs to StatsBase and other statistical packages because when they are out of agreement it breaks my things. I also make PR’s to the same packages because they are broken and them being broken breaks my things.
I have releases tools that follow an API, and made PRs to help follow that API because I want to use that API in many places.
I generally release open source packages, and discuss broader picture ecosystem improvements because I want this community to grow.

So my motivations for open sourcing and for promoting cohension comes from the same place.
and I don’t think I am atypical in this.

There are exceptions the this.
E.g. some people have made it abundantly clear that open sourcing their code is as far as they are willing to go without payment, and will refuse all requests to make it work well with other packages without payment.
Which is reasonable and their right.
Similarly, some people have made it clear that supporting code as part of a cohesion effort is something they are not willing to take into their package (and thus maintance burden) until it is proven and widely adopted. Again reasonable and their right.
But I think this is not the overall standard.

In general general saying that you can’t ask volunteers to do something, because they are working for free, is a nonstarter of an arguement.
I personally, and I believe others also, would be happy to take issues on my packages if someone said there was a way I could change them to allow for a more coherent package ecosystem.
And I absolutely have put time into both creating packages and working out logistics and such to allow for coherance.
Its hard, and a lot of it is on going, but hard things are worth doing sometimes.

We definately now do have some great meta-packages, like JuMP, MLJ, Plots.jl etc.
And to get there we have to try.

chakravala · January 4, 2020, 12:03am

I am in agreement and would also be interested in a more cohesive environment and I would be open to look at those issues also… provided it aligns with my interests and availability.

The purpose of my post was to point out the flaw in the original statement, and why in general it is a flawed statement, even if there are people who are willing to work on these things for free at times.

Albert_Zevelev · January 4, 2020, 12:12am

@ChrisRackauckas I wouldn’t be able to find many of the links w/ TS code there.

Albert_Zevelev · January 4, 2020, 12:50am

1 I’m writing from the perspective of someone who wants to see Julia thrive. Else, I wouldn’t spend time collecting TS packages & posting this note (which I happily do for free).

2 I’ve noticed community culture is important & contagious. People on this forum have been very helpful to me & consequently I’m motivated to help others if I can.

3 From my links to code above we see there is a huge amount of redundancy for time-series.
From my links to posts on Discourse we see at least some of this redundancy is from a lack of awareness about other packages.

4 As a user, I’d prefer one TS package w/ many of the functions from the other 20+ packages.
That’s my preference, I don’t expect anything from anyone.

5 I’m writing a program to train an ML model.
As a developer, I want to write a program that is easy for others to use & nicely fits into interfaces. I will consider adding my model to larger existing package, b/c I believe it’s better to have more algorithms in fewer packages.
There are many judgement calls for me to make & I wish there was more centralized organization in the Julia community.
I don’t care if we all drive on the left side or the right side, as long as we drive on the same side.

@ChrisRackauckas hit the nail on the head when he wrote:

It would be nice if other domains, such as ML and TS, were similarly organized.

tbeason · January 4, 2020, 2:45am

I think the one-size-fits-all package idea is generally sub-par. There is always a problem of where you draw the line on what belongs in the package and what does not. In the time series context, what about state space models, bayesian analysis, bootstrap resampling, etc… All very frequently used alongside time series methods, yet also frequently used outside of it. I like that these are separate pieces.

A lot of the packages listed in the OP are probably not created with the intent of becoming “THE” time series analysis package in Julia. So the fact that they are perhaps hard to find is not an issue. I don’t mind if some guy’s replication files for his thesis don’t show up on page 1 of my search results for “julia time series”.

I would say one issue is that we do not have an easy way to signal the “quality” of the package. @Tamas_Papp has the right idea but as he says it is not easy. Maybe it could be made easier?

Tamas_Papp · January 4, 2020, 8:30am

I am also a fan of modular, nicely interoperating packages that do one thing well (forming a flexible “toolbox”), as opposed to a single umbrella library (“suite” or “toolkit”) that tries to do everything. People coming from other languages where such packages are the norm often find the Julia ecosystem too confusingly diverse, but it works rather well. Cf

Topic		Replies	Views
Fixing Package Fragmentation Community	71	5812	May 28, 2023
How to know if a package is good? Community	105	6647	June 13, 2022
The present and the future of package registration Package Management	80	2092	June 11, 2023
The State of the Julia Ecosystem Community	109	8239	January 31, 2019
Discussion on "Why I no longer recommend Julia" by Yuri Vishnevsky Community discussion	298	47000	September 9, 2022

How can we create a leaner ecosystem for Julia?

Related topics