Suggestion: move DataFrames, plotting into standard distribution

Coming from heavy R “data science” usage, I do find Julia to be much more consistent and better curated than R, which has become a bit overrun with weeds at this point. However, R is still a fair bit ahead in several ways, notably dplyr and ggplot2. Yes, we have DataFramesMeta.jl and Query.jl and Gadfly.jl, but they are not at the same level yet.

Until 0.7, with the 0.11 version of DataFrames, things are going to be unsettled in Julia data land. For a class, it might be wiser to wait rather than to frustrate a generation of new users? Sorry to say that…

Is it possible, since you “hate” R that you might find Julia frustrating as well? Are you looking for a GUI? Other than a curated list of packages, what do you need?

I don’t use JuliaPro, but I know that it’s published by JuliaComputing, the organization which employs most of the Julia core developers, so it’s about as official as it can get. I also don’t think you are correct about your inability to use the free version. From the FAQ:

Is the free version of JuliaPro allowed for enterprise / commercial use?

Yes, you can use JuliaPro to develop enterprise or commercial applications.

speaking for my own use case:

  • I do not care which package is officially endorsed, or argue that an endorsed package to do X cannot change over time. I do care that there is one currently endorsed recommended package (with some rolloff if it gets deprecated). I cannot see the use of julia for data analysis without stable certified data frame and plotting features.

  • I do care that when newbies want to plot a histogram, they can spend their time researching how to do a histogram; and not researching arguments about different histogram packages—some working, others no longer, others not; some flame wars and opinions, etc. In the forum and in class, do we want to support every graphing package? if a student comes and says “I spent 5 hours to learn xyzplot, but how do I do X?” do I need to know xyzplot, too? how do I know which graphing package is stable / working / etc.? Do I need to research all options?

  • Data frames and graphs are as important basics to student data analysis as the other mentioned packages (StatsBase, Distributions, GLM, etc.)

  • A GUI could be nice but it is not important.

  • A package manager that is clear in endorsements would help, but it’s not enough.

  • Even if many people use R not for core functionality, but for packages, I am pretty sure that R’s usefulness would greatly diminish if it did not have a blessed data-set structure and a blessed plotting system. (Maybe the standard could be better. I would not mind deprecating R dataframes in favor of R datatables. Maybe it is not so bad to have a basic system and a better downloadable alternative. my point is that the choice of standard is not important. the presence of standard is important.)

I don’t want to take more time. I think I made my point, and the choice is up to you guys.

best,

/iaw

2 Likes

I am also a heavy R user and I was follwing the development around DataFrames from the viewpoint of and R programmer for a long time.

I think the main difference is that R is more a domain-specific ecosystem of packages and a programming language while Julia is a general-purpose language and an ecosystem.

Data.frame is a first-class citizen and everyone expect all the relevant packages to support data.frames. Even the new tabular data structures in R (data.table, tibbles) are almost perfectly backward compatible with data.frame.

In Julia, a flexible programming language, DataFrame is just one of several possible solutions to a problem with tabular data.

Its a tradeoff between flexibility and convenience. I am not sure we can have both at the same time.

1 Like

Pkg3 will have a list of curated packages similar to CRAN for R. These packages are suppose to comply with the best practices and whatnot. For recommended or endorsed packages, I don’t think languages usually have these. Julia used to have a list of featured or recommended packages, but dunno if they will bring that back on. For now, probably asking users and forums, tutorials, material will lead to those.

Cross-posting: Multivariate OLS - #15 by ChrisRackauckas

it is a matter of where to draw the line.

and you draw the line at whatever is needed for

finance, economics, and statistics classes

But that’s quite arbitrary. MATLAB and Octave come with pretty terrible statistics support but that’s fine. However, if they pulled out the ODE solvers people would rebel because that’s such core functionality. So why wouldn’t you say it’s a core functionality?

Because what’s core is personal, except for the extreme basics. Julia is trying to make Base be those extreme basics because otherwise everything is core to some segment of the population and you get bloat.

How do you handle it? You could just tell your class about the 5 or so packages to use. Or if it’s just stats, point them to JuliaStats. It really shouldn’t take more than 5 minutes to introduce the stats packages. If you want it pre-installed, as @aaowens suggests, just have them install JuliaPro which was created exactly for this purpose.

But you’re not going to convince anyone that the top 100 packages should go to the Julia Base repository, or even that the top 3 packages you care about should. Julia’s already been there, and what it does is the opposite of what you’re thinking. Packages in Julia Base cannot update regularly because they are tied to Julia releases. They are harder for contributors to jump into since there’s so much other code around them. They are harder to test because they are tested with the rest of Julia. In the end what it does is cause stagnation due to the inertia of larger repositories, while on its own DataFrames.jl is nimble and can release bugfixes almost instantly. Julia had a lot of stuff in Base and is getting leaner for this reason.

6 Likes

Completely agree.

I use Julia as a free Matlab and modern Fortran. Matlab has very bad data wrangling tools and in most cases I use arrays instead of data frames. I found Julia much powerful than Matlab.

However, if people use Julia as a R alternative, they will find Julia not as mature as R.

So I keep data cleaning and visualization part in R and numerical part in Julia.

2 Likes

This is great feedbacks (probably not news to everyone). Julia could easily become mainstream if these problems are solved.

Do you want to provide some specifics? What do you find convenient/easy in R when compared to Julia?

I think JuliaPro is indeed what I should consider the basics, not Julia. My mistake. See, even at an old age, one can still learn.

as for me, I use perl for the data cleaning, not R.

the csv reading aspects for dataframes aspects in julia seem to be still evolving. maybe julia will get there.

the plots in R are beautiful. and R has just two plotting systems, one in the box, both producing nice output. and R builds plots a lot faster, being interpreted rather than compiled.

for someone who already knows R, the R+Julia strategy makes sense.

First, I still could not figure out how to connect to remote Postgresql server in Julia (not sure if it is doable now). I rely on this step to retrieve my data.

Second, R has many packages that make it easier to get financial and econ data. Like

https://cran.r-project.org/web/packages/tidyquant/vignettes/TQ01-core-functions-in-tidyquant.html

Third, the hadleyverse ecosystem is very easy to use and teach. For example, I personally think the dplyr is easier and less verbose than DataFramesMeta, see

https://www.juliabloggers.com/data-wrangling-in-julia-based-on-dplyr-flights-tutorials/

Fourth, the integration of dplyr and ggplot2 via pipeline is very neat and saves a lot of time.

1 Like

I’m working on it and getting closer to something usable :slight_smile:

6 Likes

Great work towards that end @davidanthoff - FWIW the things that I notice most that keep me from the epic dplyr patterns I had in R, are:

I know you’re working hard, for free, so I’m not complaining! Thanks!

Could you describe a bit more what you mean by that?

In general, I’m really not happy with the column handling right now in Query. I can fix all of that in a quite elegant way on julia 0.7 with the native named tuples, so I’m kind of holding of on doing anything on that front until 0.7 is a real thing…

The multiple @lets I also know how to fix, but it is a complicated fix… I essentially need to get rid of my reliance on type inference. I have a strategy and started the work, but don’t expect anything anytime soon, I’m afraid.

I’ve also started thinking about window functions a bit, but still need to find a design that fits nicely into the philosophy of Query.jl…

And thanks for the kind words :slight_smile:

My experiences with in bioscience is very similar: R’s built in plotting and ubiquitous dataframes hold everything together.

@ChrisRackauckas Well said Sir.
Julia IS an evolving language. That’s what makes it so much fun to be involved with.
And just wait for the 1.0 release - I bet lots of organisationw will surface and start using Julia.
So please lets not have any bloat in Base. For instance there might be some rapid development needed in DataFrames. As you say this can be undertaken, safe in the knowledge that other fields (terrible pun intended) will be unaffected.

Also I would like to learn more about Pkg3
The concept of having a set of 'blessed’packages is a good one.
I propose @TimHoly be elected Blesser in Chief. Nominative determinism at its best.

Ps. I think the Julia packaging system is great. I know there are other similar systems out there. But as an old FORTRAN head who literally managed packages by putting cards into a hopper, and a long term Perl user where there are endless runs of cpan to load in packages, the Julia system is a joy.

For instance I just looked at Tim Holy’s GitHub page and I’m now playing with ProgressMeter with a Pkg.add(“PogressMeter”) and I didn’t even have to exit the REPL.

1 Like

This specific feedback is actually very useful.

Concerning verbosity, I’ve translated the same hflights tutorial to JuliaDB here and I find the syntax reasonably concise. Can you pinpoint specific cases that could be improved/specific suggestions on how to do so? As a caveat, you need a very recent version of JuliaDB and IndexedTables to run the tutorial as most syntax improvements are recent.

Concerning dplyr ggplot2 integration via pipeline (I’m not sure how that’d work exactly as I’m not a R user) the closest I can think of is the @df macro from StatPlots which is fully integrated with the Query/IterableTables framework (though I have some idea to simplify the syntax even further when plotting from a @map statement, but I haven’t quite decided how, I’m curious what @davidanthoff has in mind). There is also GroupedErrors to make plots from data tables if you’re working with grouped data.

See this announcement, though I haven’t focused on Query integration (as ShiftedArrays and Query have different missing data representation): will think about it once there is convergence.

I’d be curious to see how to add rownumber: what does it do exactly? You can use it inside a groupby and it will give you the row numbers of the group as computed inside the larger dataset?

2 Likes

Basically, yes. In dplyr the row_number() is the number of the row within a group, given your chosen sort order. In R, I can’t think of a place I’ve used it except for row_number() == 1, essentially to get the first member of a group. So, a isfirst() would be even more useful. The lead and lag functions are useful for moving averages and deltas.

There is the opportunity to do better than R with Julia here, ultimately.

Edit: Correcting myself, I have also used row_number() to create an ID to an item within a grouping. So in R, I’d arrange(main_id, foo, date); group_by(main_id, foo); mutate(foo_id = row_number()) . That’d permanently capture the order of foo’s for each main_id even after I left that grouping.

Actually there are two things you can use instead in Julia (which I believe are just as convenient), you can find both in the Window functions section of this tutorial:

  1. select to choose the first few elements according to the sort order (here it’s reverse sorting by DepDelay). It’s actually a very smart function that can give the first n element of the sorted vectors without sorting the whole thing and can use all of sort keywords:
groupby(fc, :UniqueCarrier, select = (:Month, :DayofMonth, :DepDelay), flatten = true) do dd
    select(dd, 1:2, by = i -> i.DepDelay, rev = true) # select first two element of subdataframe according to custom sorting
end
  1. ordinalrank from StatsBase can give you the ranking according to some custom sort order (other rankings are also available): see docs.

More generally I think we really need to start porting R data wrangling tutorials/vignettes in Julia as a lot of the functionality is there but not everybody knows about it.

Crummy excel spreadsheets also make referring to the first row useful, even if only for deleting it.