How do DataFrames.jl compare to R's? And Interoperability between R and Julia

Congratulations on the new package. Feel free to state how it (and Julia in general) compares to R, or at least these specific questions.

a) How does the solution compare to R’s? I mean, should I [keep] refrain from recommending Julia over R because of the NA issue or is has DataFrames now settled to this final and good solution? Is it at least as fast (in Julia 0.7/1.0) as R’s?

Typing Vector{Union{T, Missing}} seems like complex (assumed more than in R), but DataVector{T} not (meaning the same). Is Julia now easy to use/develop in (not just fast) compared to R (with or without DataFrames), for Missing values?

b) I know R has currently more libraries, but can you reuse all of them, with RCall (similar to Python’s with PyCall/PySyntax)?

Assuming Missing values are now easy/handled, does that extend to easily interopting with R when you have missing data? Or is it complex but at least very possible?

c) I’m confident Julia’s plotting system[s] is on good track if not already better, but at a minimum, you can reuse [all plotting] libraries of R easily through RCall?

d) RCall seems like a cool package, with the R REPL mode, is this just an issue that’s already fixed (can’t check, I’m not on Windows) and someone forgot to close?

1 Like

As a general comment, and as I’ve said many times in this forum before, in my personal opinion the “killer” advantage of Julia DataFrames, which is primarily thanks to the elegance of Julia itself, is that Julia dataframes are lightweight, simple to understand, and don’t involve any special data structures beyond the DataFrames themselves. To me this is certainly a mammoth advantage over the opaque and bloated pandas, which requires one to read several pages of documentation and experience with numpy to use properly. I’m much less familiar with R but I have to imagine similar statements can be made.

  • a) I’ve never seriously used R, so I’ll leave it to others to answer.
  • b) This goes along with my general comment: because DataFrames is so simple, using functions from RCall with them is about as simple as it can possibly get. You will have to be careful with missing since it can’t be converted to R, but writing functions to handle it properly (converting it to missing values in R) should be trivial.
  • c) I think the answer to this is a definite yes, but I don’t actually have experience with it.
  • d) Again, for others to answer.

(I realize my answer here wasn’t hugely useful, I was just over-eager to get in my two-cents about the principle advantage of Julia DataFrames :laughing:)

Another general comment is that one of R’s main advantages is the tidyverse, especially dplyr and piping with %>%.

Julia is still building an ecosystem of similar breadth and quality. Packages like Query.jl and DataFramesMetaare on their way there, and ultimately Julia’s data managing ecosystem will be superior to R’s due to speed and the lack of opaque lazy evaluation.

You should give those packages a try, the difference will be like switching from base R to the tidyverse. But since Julia is not yet in 1.0, things are likely to break ocasionaly. For example, Query.jl is still adapting to the addition of Missings. This is not to say you shouldn’t explore them now!

Not quite. R data.frame are very lightweight, basically the equivalent of named tuples in Julia.

I think that Julia is adopting a lot of R’s syntax and semantics, which is a good thing. R’s data.frames are really mature and well integrated into the language.

5 Likes

This is something I have been thinking about.

If you look at the events and tools that RStudio puts out, it seems like a lot of it is directed at people switching from excel or SPSS to R. Shiny, and particularly RMarkdown, fill a lot of niches that one needs to leave the Office / point and click ecosystem.

Panda’s syntax is not nearly as nice and easy as tidyverse, and it likely has to do with the fact that Pandas is not trying to be an office replacement. I am glad that Julia’s dataframes ecosystem is striving for R’s great syntax even though Julia is not marketing itself towards the same market.

At the risk of derailing this thread, I see it as more of an issue that pandas relies extensively on Cython code and numpy, so absolutely everything requires specialized functions and data structures. This requires a user to have extensive knowledge on both pandas and numpy and would make it relatively difficult for the user to write naive code even if python were reasonably performant in the first place. The shear size of the pandas documentation, for something which essentially just does groupby’s and joins is a clear sign that something is amiss.

I don’t mind if we don’t have R’s syntax. I’m not an R user and understand it’s an ugly language… But people liking it could use RCall’s R REPL.

But semantics, yes, I understand R is good with Missing values, and that’s the only thing I’m asking here, how we compare there - already - not theoretically. I’m confident we could replicate all their libraries or just call them with RCall.jl.

Seems tidyverse maps to our package system (or just Julia Pro’s), with lots missing I presume, but at least it’s dplyr already replicated.

I do not worry about RStudio and their debugger etc. We’ll get similar tools, and maybe even RStudio support.

Let me clarify this a bit. Query.jl works with sources and sinks that use Missing (i.e. it works with DataFrames.jl v0.11). That kind of integration is done and works.

There are no plans as of right now to use the Missings story in Query.jl because the Missing design and implementation (as they exist right now) don’t compose with the design of Query.jl. If you use Query.jl as your main API for manipulating data, DataValue will continue to be the API you’ll use for missing data. So on the positive side, nothings changes and everything will continue to just work on the Query.jl side of things (which I like a lot, I try VERY hard to not break things for folks). On the negative side it is obviously less than ideal that we now have two different user facing missing data stories around. But that is where we are. Will this situation ever change? I don’t really know. Jeff has hinted that there might be some compiler work that could be done to resolve the issues between Missing and Query.jl, but I don’t think anyone has a real design/plan for anything like that (in the sense of “is this actually feasible?”) and it seems out of the question that any of this might be julia 1.0 stuff (if I understand the current feature freeze plans correctly).

3 Likes

It sounds frustrating, but at the same time, it seems pretty awesome that you can create an alternate version of missingness without performance penalty.

a) I would say it is basically the same now usability-wise. The missing data mess in DataFrames is solved now and you should not refrain from recommending Julia + DataFrames because of that. The package ecosystem as a whole needs time to catch up and mature, of course, but note that the Missing type is part of the language now in Julia 0.7/1.0.

b) Yes, you should be able to use them with RCall and it is also compatible with the new Missing type.

c) Yes, you should be able to do that. For example:

using RCall
R"library(ggplot2); qplot(data=data.frame(x=rnorm(100), y=rnorm(100)), x=x, y=y) + geom_smooth(method='loess')"

(Note that there are different ways to call R with RCall - see the docs)

d) Not sure what that issue there is/was but I can use RCall just fine on my Windows 10 machine.

5 Likes

For an end user there shouldn’t be any frustration with Query as everything should “just work” (the internal representation doesn’t make any difference for usage).
If Query.jl does not suit your needs you might want to check out DataFramesMeta.jl.

2 Likes

The thing about tidyverse is that on basis of it a complete data science work frame was elegantly built. From data input to analysis and visualization, great tools keeps being invented.

I just found a R package called tidyquant, which smartly integrates the data science frame work (tidyverse), statistics frame work (mainly time series packages), and financial tools (quantmod, performance analysis, etc). Hopefully Julia should have an ecosystem like this.

From a CS perspective, R and Python are not so great in a lot of ways. However, their success is no coincidence. In general, it is too expensive to hire a qualified c++ programmer and most engineers, scientists, financial analysts do not want to think about the language itself, they just want to get things done.

Thanks for confirming e.g. a)

With that out of the way, do you see any reason whatsoever to recommend R over Julia or Julia + R with RCall.jl?

You say Julia’s “package ecosystem as a whole needs time to catch up and mature, of course”, but that’s really only if you want to rely on it and not use any of R’s?

I don’t know too much about R, but even if you’re an R user, or people recommend it to you for some [statistics] task, would you say you would recommend R over Julia based or R’s language (as opposed to libraries or other issue)?

[I’ve heard one exception, highly regulated industries, that e.g. need FDA certification of their system. They’re very conservative, and not replacing R, just adding to it may not be ok… Then learning a language/books may be an issue.]

For the time being, it may be worse to use two library ecosystems, or so people perceive, but I don’t see it that way. You can lean on R’s or Python’s as much as you like. You only need to know those other languages if you’re going to maintain the packages from other languages, or to rewrite them in Julia.

I had never programmed in Python, and I learned just enough to wrap a Python library (and probably could with R). You could also do so for R. Most wouldn’t even need to know that much, just use the wrapper.

Julia’s DataFrames are currently less usable than R’s, somewhat by design. I’m not talking about number of libraries, but the ease with which Julia DataFrames can be intuitively manipulated.

Examples:

  1. Select half the data

    julia> using DataFrames
    julia> df = DataFrame(x=1:6, y=1:6)

    6×2 DataFrames.DataFrame
    │ Row │ x │ y │
    ├────┼───┼───┤
    │ 1 │ 1 │ 1 │
    │ 2 │ 2 │ 2 │
    │ 3 │ 3 │ 3 │
    │ 4 │ 4 │ 4 │
    │ 5 │ 5 │ 5 │
    │ 6 │ 6 │ 6 │

    this works

    julia> df[1:3, :]
    3×2 DataFrames.DataFrame
    │ Row │ x │ y │
    ├─────┼───┼───┤
    │ 1 │ 1 │ 1 │
    │ 2 │ 2 │ 2 │
    │ 3 │ 3 │ 3 │

    this doesn’t

    julia> rows = 1:floor( nrow(df)/2 )
    julia> df[rows, :]

    ERROR: ArgumentError: invalid index: 1.0
    Stacktrace:
    [1] macro expansion at ./multidimensional.jl:527 [inlined]
    [2] macro expansion at ./cartesian.jl:64 [inlined]
    [3] macro expansion at ./multidimensional.jl:525 [inlined]
    [4] _unsafe_getindex! at ./multidimensional.jl:519 [inlined]
    [5] macro expansion at ./multidimensional.jl:513 [inlined]
    [6] _unsafe_getindex(::IndexLinear, ::Array{Int64,1}, ::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}) at ./multidimensional.jl:506
    [7] macro expansion at ./multidimensional.jl:495 [inlined]
    [8] _getindex at ./multidimensional.jl:491 [inlined]
    [9] getindex(::Array{Int64,1}, ::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}) at ./abstractarray.jl:883
    [10] copy!(::Array{Any,1}, ::Base.Generator{Array{Any,1},DataFrames.##46#47{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}) at ./abstractarray.jl:573
    [11] _collect(::Type{Any}, ::Base.Generator{Array{Any,1},DataFrames.##46#47{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}, ::Base.HasShape) at ./array.jl:396
    [12] getindex(::DataFrames.DataFrame, ::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}, ::Colon) at /home/rh/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:274

  2. Operating on the data

    # we can extract the DataFrame's contents with
    julia> A = Array(df)
    julia> A = A .* 2

    # but we can't put them back
    julia> DataFrame( a, names(df) )

    ERROR: MethodError: no method matching DataFrames.DataFrame(::Array{Int64,2}, ::Array{Symbol,1})
    Closest candidates are:
    DataFrames.DataFrame(::AbstractArray{T<:Type,1}, ::AbstractArray{Symbol,1}, ::Integer) where T<:Type at /home/rh/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:130
    DataFrames.DataFrame(::AbstractArray{T<:Type,1}, ::AbstractArray{Symbol,1}, ::Array{Bool,1}, ::Integer) where T<:Type at /home/rh/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:154
    DataFrames.DataFrame(::AbstractArray{T,1} where T, ::AbstractArray{Symbol,1}) at /home/rh/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:125

Yes, I know that there are Julian ways to do both of these. The point is that there are intuitive operations that work in R, but not Julia. I’m not saying that Julia DataFrames should support these operations. R has a single numeric type, and creates implicit bindings inside the dataframe in order to make many operations “just work”.

Julia doesn’t do this, but in exchange we get multiple dispatch, and language functionality beyond dataframes. (Try doing complicated functional programming in R, it’s mess.)

Nonetheless, people familiar with R have high expectations about how much of DataFrames should “just work”, and incorrect intuition about how to interact with the Julian data structure. I’ve an add-on package with some workarounds for common data munging procedures; been meaning to upload it forever.

3 Likes
rows = 1:nrow(df) ÷ 2
df[rows, :]#Works
1 Like

You should not expect you would be understood if you speak French in the US or vice versa!

Have you tried:

rows = 1:floor(Int,nrow(df)/2)
df[rows,:]

floor() is a little bit more useful in Julia than in R :slight_smile:

Also

df2 = convert(DataFrame,a)
names!(df2,names(df))

works and is simple enough. Be a little bit more patient in the Julialand!

Fine if you’re working in Atom. Now how do you input ‘÷’ in vi; emacs; sublime; gedit; bash/zsh/ksh terminals, notepad; visual studio; text edit; and all the other editors people may use? I think this work-around falls under the ‘not intuitive’ category.

Even if you know it works, can you quickly and conveniently intuit the input method for whichever editor you may be using?

Compare to ‘/’. The ‘/’ operator is universal and intuitive, but in the previous use-case, behaves unexpectedly in Julia (to an R user).

@mwsohn

Yes, I know that there are Julian ways to do both of these.

My point is that dataframes in Julia and R have very different implementations and require very different mental models for fluency. Even if there is some nice R-like syntax in the DataFrames package, the fundamental differences lead to many little bugs.

I understood you correctly the first time. My point exactly. When you are a baby, you speak like a baby. When you grow up, you don’t speak the baby language any more. If you do and complain that you are not properly understood, it’s your fault.

Who’s complaining? This is a discussion about Interoperability between R and Julia.

1 Like