Suggestion: move DataFrames, plotting into standard distribution

Dear Development Team—I am not an expert user. I am a professor of finance and economics, teaching medium sophisticated statistics-related analysis. R is our go-to package, and I hate it. I love a lot of what I see in julia. I am working on a document to help facilitate adoption.

I want to suggest that you consider elevating two packages that implement data frames and graphing as first-class packages that will be integral and ship with julia. They should be blessed as official, maintained, supported, and with a path forward. (They do not have to be feature rich.)

To us, especially when it comes to displacing R, these are core functionalities of statistical programming languages, and it works badly when everyone searches for and ends up using a different package. It then requires a lot of reconciling across different users who stumble across different packages, that sometimes break, sometimes get abandoned by their developers as they move on, etc.

thank you all for all the amazing work you are doing. (and I thank all the folks who have been helping me answering questions on this forum, without I could never have undertaken my attempt to switch.)

regards,

/iaw

6 Likes

It’s certainly reasonable to want things like plotting to “just work”, but I personally would not want to see it built into Julia. One of the best features of Julia is that code in packages can be just as performant and powerful as code in Base, which means that we can have a nice modular ecosystem from which users can choose the pieces they actually need.

On the other hand, I totally agree that there’s a reasonable use case for a more batteries-included distribution. Fortunately, we already have https://juliacomputing.com/products/juliapro.html (which includes a curated set of packages for things like plotting and stats). Does that not already accomplish what you’re asking for?

5 Likes

From what I understood, the new package manager will have a “curated” list to clarify that the package is well maintained and has responsive developers. That will certainly help preventing users from choosing obscure and abandoned packages.

If you need a plotting package to plot from a DataFrame, both StatPlots and Gadfly should be quite good. I’m not sure if StatPlots is part of JuliaPro though.

As a very frequent user of DataFrames and plotting I think it would make absolutely no sense to move them to Base. I don’t even see them as being “core functionality” in any sense. At least DataFrames are really rather domain specific.

I do think the overall ecosystem problem you are describing is, fortunately, in the process of getting much better.

  • The current package manager (in 0.6) is so temperamental it gets broken if you look at it the wrong way. I would imagine that a lot of the people you are describing turning a way from Julia because they wound up with a broken package did so simply because the package manager wasn’t even getting the actual most recent version for them because of some dependency.
  • The DataFrames package itself is now in great shape (in my opinion, no I’m not a committer, at least as far as I can remember). This is a relatively recent development with the introduction of Missings and some great performance work that has been done. It is also beautifully generic: columns need only be AbstractVector objects, which allows you to do all sorts of wonderful and crazy things. Also DataFramesMeta was resurrected not that long ago, and is a wonderful tool for doing query-like stuff. I think that it will just take some time for things to coalesce around this new DataFrames but that will definitely happen and we are well on our way. Once 0.7 is released the performance of Missings should be improved significantly as well (though I have often found it to be surprisingly good even in 0.6).
  • IndexedTables is a separate effort with a slightly different base use case than DataFrames. It sees a lot of development which I think is mostly focused on 0.7. It might seem a little confusing to have two packages like this, but both are quite nice so I see this as more of a benefit than a problem.

The plotting situation is worse I think. The real issue there is that everything is unfathomably slow because of excessive code generation. This won’t be remedied by moving anything to Base (in fact it’s a reason why that code should most definitely not be in Base). Again, I think it will just take people some time to fix the existing problems. I know it’s being worked on, but while DataFrames is in great shape already at this moment, plotting will have to wait a bit.

6 Likes

I also teach finance with R. I am curious about why you hate R
:grinning:

I agree that DataFrames and the basic Plots should be moved into stdlib.

1 Like

I agree with @ExpandingMan that both DataFrames and IndexedTables are very nice packages and wanted to point out that StatPlots allows you to plot from either with exactly the same syntax.

using StatPlots
@df iris scatter(:SepalLength, :SepalWidth, group = :Species)

will work regardless of whether iris is a DataFrames or an IndexedTable. Hopefully this will be the case for most packages in the future (it already is to some extent thanks to the IterableTables package).

1 Like

I hope I am not taking too much time.

  • I do not mean to include this functionality in Base.

  • I do mean to include some packages bundled with julia on download, and guarantee activity (bug-fixes and availability) for at least three more years. If something much better rolls along, it can become deprecated and replaced with a rollover period.

  • A curated list will indeed help, especially if a newbie can see what the julia team suggests. someone who wants to learn how to plot does not want to learn all the possibilities of learning how to plot. they want to plot. they want a “go-to” package. (Sort of like STL with C++, or the set of core modules in perl (Perl core modules - Perldoc Browser) or … .)

  • I did not even know about JuliaPro. Is JuliaPro blessed? I also do not like the license. As a university, I cannot use “home,” and as a university I cannot pay $1,550 per student per year.

  • I hate R primarily because finding errors is problematic, and secondarily because the language does not attempt to try to improve on oodles of inconsistencies, many but not all arising from the fact that it does not know whether it wants to be interactive or batch. For example, why does df$ab not give an error “myprogram:32: reading undefined column ab in df” ? why is subset(d, true, select= -c("ab")) not working? My impression is also that the R community has also become a fairly unfriendly environment, with slowing improvements. on the plus side, when R is working well, it can be beautiful. if julia were not on the way, I would probably describe R as the best thing around.

  • I do hope/believe that one of julia’s long-term goal is to offer a substitute solution for R. This means offering many of R’s core facilities—and both data frames and plotting are in R’s core systems. without direct alternative, it will be a less likely outcome.

/iaw

1 Like

We do kind of need a place where a newcomer can come in and “get a lay of the land” so to speak. For example, anybody working with data frames should immediately know about the new DataFrames with missings, and right now I’m not so sure that would happen. This is in spite of the fact that a few curated lists already exist (see this great website).

I just noticed this page on the Julia website. I think this is a really great start and it would be awesome if we could greatly expand it. Putting information directly on the main Julia website, which still seems to be the first thing the average person will find when looking for information on Julia, seems the best possible option.

2 Likes

There have bee many discussions of which functionalities should be moved from Boost to STL in C++. For Julia, this discussion is also interesting from a perspective of PL.

From a perspective of education, I think it is easier for students to try some basic Julia codes (data wrangling and graphing) without having to install any packages. If you want your students to have a feeling about Julia without having to download it, JuliaBox could be a good start point.

JuliaPro has a free version.

R core does have a lot of nasty features. Most people use R for its package ecosystem instead of the basic functions.

The stdlib collection typically uses only libraries which are critical or are dependencies for Base (e.g., LinearAlgebra, Random, Dates, Test). For example, Dates used not to be a stdlib, but got added later on (mostly because many packages use it as dependency and the language added the structs). A good candidate for a new stdlib is Missings which provides a very general use struct. However, DataFrames is part of the JuliaData organization which is doing a good job at maintaining it. The JuliaStats ecosystem has a few which also provide a nice toolkit (StatsBase, Distributions, StatsModels, GLM).

In the case with R, DataFrames are provided by the R language, but it really sucks (plotting too)! That is why one uses tidyverse and data.table. The good aspects of the Data/Stats/Plotting are provided on top of base. Julia needs to improve its plotting critically and keep improving Data/Stats, but by no means basing those ecosystems will guarantee better development.

1 Like

Coming from heavy R “data science” usage, I do find Julia to be much more consistent and better curated than R, which has become a bit overrun with weeds at this point. However, R is still a fair bit ahead in several ways, notably dplyr and ggplot2. Yes, we have DataFramesMeta.jl and Query.jl and Gadfly.jl, but they are not at the same level yet.

Until 0.7, with the 0.11 version of DataFrames, things are going to be unsettled in Julia data land. For a class, it might be wiser to wait rather than to frustrate a generation of new users? Sorry to say that…

Is it possible, since you “hate” R that you might find Julia frustrating as well? Are you looking for a GUI? Other than a curated list of packages, what do you need?

I don’t use JuliaPro, but I know that it’s published by JuliaComputing, the organization which employs most of the Julia core developers, so it’s about as official as it can get. I also don’t think you are correct about your inability to use the free version. From the FAQ:

Is the free version of JuliaPro allowed for enterprise / commercial use?

Yes, you can use JuliaPro to develop enterprise or commercial applications.

speaking for my own use case:

  • I do not care which package is officially endorsed, or argue that an endorsed package to do X cannot change over time. I do care that there is one currently endorsed recommended package (with some rolloff if it gets deprecated). I cannot see the use of julia for data analysis without stable certified data frame and plotting features.

  • I do care that when newbies want to plot a histogram, they can spend their time researching how to do a histogram; and not researching arguments about different histogram packages—some working, others no longer, others not; some flame wars and opinions, etc. In the forum and in class, do we want to support every graphing package? if a student comes and says “I spent 5 hours to learn xyzplot, but how do I do X?” do I need to know xyzplot, too? how do I know which graphing package is stable / working / etc.? Do I need to research all options?

  • Data frames and graphs are as important basics to student data analysis as the other mentioned packages (StatsBase, Distributions, GLM, etc.)

  • A GUI could be nice but it is not important.

  • A package manager that is clear in endorsements would help, but it’s not enough.

  • Even if many people use R not for core functionality, but for packages, I am pretty sure that R’s usefulness would greatly diminish if it did not have a blessed data-set structure and a blessed plotting system. (Maybe the standard could be better. I would not mind deprecating R dataframes in favor of R datatables. Maybe it is not so bad to have a basic system and a better downloadable alternative. my point is that the choice of standard is not important. the presence of standard is important.)

I don’t want to take more time. I think I made my point, and the choice is up to you guys.

best,

/iaw

2 Likes

I am also a heavy R user and I was follwing the development around DataFrames from the viewpoint of and R programmer for a long time.

I think the main difference is that R is more a domain-specific ecosystem of packages and a programming language while Julia is a general-purpose language and an ecosystem.

Data.frame is a first-class citizen and everyone expect all the relevant packages to support data.frames. Even the new tabular data structures in R (data.table, tibbles) are almost perfectly backward compatible with data.frame.

In Julia, a flexible programming language, DataFrame is just one of several possible solutions to a problem with tabular data.

Its a tradeoff between flexibility and convenience. I am not sure we can have both at the same time.

1 Like

Pkg3 will have a list of curated packages similar to CRAN for R. These packages are suppose to comply with the best practices and whatnot. For recommended or endorsed packages, I don’t think languages usually have these. Julia used to have a list of featured or recommended packages, but dunno if they will bring that back on. For now, probably asking users and forums, tutorials, material will lead to those.

Cross-posting: Multivariate OLS - #15 by ChrisRackauckas

it is a matter of where to draw the line.

and you draw the line at whatever is needed for

finance, economics, and statistics classes

But that’s quite arbitrary. MATLAB and Octave come with pretty terrible statistics support but that’s fine. However, if they pulled out the ODE solvers people would rebel because that’s such core functionality. So why wouldn’t you say it’s a core functionality?

Because what’s core is personal, except for the extreme basics. Julia is trying to make Base be those extreme basics because otherwise everything is core to some segment of the population and you get bloat.

How do you handle it? You could just tell your class about the 5 or so packages to use. Or if it’s just stats, point them to JuliaStats. It really shouldn’t take more than 5 minutes to introduce the stats packages. If you want it pre-installed, as @aaowens suggests, just have them install JuliaPro which was created exactly for this purpose.

But you’re not going to convince anyone that the top 100 packages should go to the Julia Base repository, or even that the top 3 packages you care about should. Julia’s already been there, and what it does is the opposite of what you’re thinking. Packages in Julia Base cannot update regularly because they are tied to Julia releases. They are harder for contributors to jump into since there’s so much other code around them. They are harder to test because they are tested with the rest of Julia. In the end what it does is cause stagnation due to the inertia of larger repositories, while on its own DataFrames.jl is nimble and can release bugfixes almost instantly. Julia had a lot of stuff in Base and is getting leaner for this reason.

6 Likes

Completely agree.

I use Julia as a free Matlab and modern Fortran. Matlab has very bad data wrangling tools and in most cases I use arrays instead of data frames. I found Julia much powerful than Matlab.

However, if people use Julia as a R alternative, they will find Julia not as mature as R.

So I keep data cleaning and visualization part in R and numerical part in Julia.

2 Likes

This is great feedbacks (probably not news to everyone). Julia could easily become mainstream if these problems are solved.

Do you want to provide some specifics? What do you find convenient/easy in R when compared to Julia?

I think JuliaPro is indeed what I should consider the basics, not Julia. My mistake. See, even at an old age, one can still learn.

as for me, I use perl for the data cleaning, not R.

the csv reading aspects for dataframes aspects in julia seem to be still evolving. maybe julia will get there.

the plots in R are beautiful. and R has just two plotting systems, one in the box, both producing nice output. and R builds plots a lot faster, being interpreted rather than compiled.

for someone who already knows R, the R+Julia strategy makes sense.

First, I still could not figure out how to connect to remote Postgresql server in Julia (not sure if it is doable now). I rely on this step to retrieve my data.

Second, R has many packages that make it easier to get financial and econ data. Like

https://cran.r-project.org/web/packages/tidyquant/vignettes/TQ01-core-functions-in-tidyquant.html

Third, the hadleyverse ecosystem is very easy to use and teach. For example, I personally think the dplyr is easier and less verbose than DataFramesMeta, see

https://www.juliabloggers.com/data-wrangling-in-julia-based-on-dplyr-flights-tutorials/

Fourth, the integration of dplyr and ggplot2 via pipeline is very neat and saves a lot of time.

1 Like