Please recommend a Julia ecosystem for Statistics

pmarg · June 7, 2019, 6:44pm

I have, but there are some cases of reshaping that are not implemented AFAIK. For example from this:

4×3 DataFrame
│ Row │ ID    │ A2018 │ A2019 │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 3     │ 3     │
│ 4   │ 4     │ 4     │ 4     │

To this:

8×3 DataFrame
│ Row │ ID    │ Year  │ A     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │
│ 2   │ 1     │ 2019  │ 1     │
│ 3   │ 2     │ 2018  │ 2     │
│ 4   │ 2     │ 2019  │ 2     │
│ 5   │ 3     │ 2018  │ 3     │
│ 6   │ 3     │ 2019  │ 3     │
│ 7   │ 4     │ 2018  │ 4     │
│ 8   │ 4     │ 2019  │ 4     │

In Stata this is implemented by reshape long A, i(ID) j(Year). This is pretty common in panel data sets that are organized as wide. I think it’s on the radar since David opened an issue to look into this functionality: https://github.com/queryverse/Query.jl/issues/256.

StefanKarpinski · June 7, 2019, 7:07pm

It is simply untrue that the tidyverse is the only option for how to deal with data in R. I learned R before any of that existed and all of the stuff I used still exists and is used by many people. So if you think there is one true way to do it in R, that is only because you happen to be listening to just one predominant but relatively recent group of R developers. If there was one way to deal with data in R dictated by the core team, then the tidyverse wouldn’t exist at all.

bkamins · June 7, 2019, 8:26pm

You can use stack or melt for this with the only exception that you will have add one more line to strip A from year and convert it to an integer.

nilshg · June 7, 2019, 10:11pm

For posterity:

julia> using DataFrames

julia> df = DataFrame(ID = 1:4, A2018 = 1:4, A2019 = 1:4)
4×3 DataFrame
│ Row │ ID    │ A2018 │ A2019 │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 3     │ 3     │
│ 4   │ 4     │ 4     │ 4     │

julia> names!(stack(df, 2:3), [:year, :A, :ID])
8×3 DataFrame
│ Row │ year   │ A     │ ID    │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ A2018  │ 1     │ 1     │
│ 2   │ A2018  │ 2     │ 2     │
│ 3   │ A2018  │ 3     │ 3     │
│ 4   │ A2018  │ 4     │ 4     │
│ 5   │ A2019  │ 1     │ 1     │
│ 6   │ A2019  │ 2     │ 2     │
│ 7   │ A2019  │ 3     │ 3     │
│ 8   │ A2019  │ 4     │ 4     │

bkamins · June 7, 2019, 10:42pm

Alternatively you can pass variable_name and value_name kwargs to stack and melt.

davidanthoff · June 7, 2019, 11:32pm

Queryverse (this is a better link than the one posted before) is a collection of packages, and you can pick and chose which of those you want to use or not. Everything should entirely interop with pretty much every other data package on julia, so it certainly in no way is a “closed” ecosystem. The package Queryverse.jl is a meta-package that pulls in all the packages that make up the Queryverse. It loads a lot of stuff. For some folks that is convenient, but I for example typically use the individual packages individually.

Juan · June 8, 2019, 1:50am

And I prefer data.table over tidyverse.

pmarg · June 8, 2019, 6:47am

Thanks, I didn’t know about the kwargs. Is there a way for this to work on sets of variables?

From this:

2×5 DataFrame
│ Row │ ID    │ A2018 │ A2019 │ B2018 │ B2019 │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │ 2     │ 2     │

To this:

4×4 DataFrame
│ Row │ ID    │ Year  │ A     │ B     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │

AFAIU I would need to melt for each set separately, change the values of Year from (A2018,A2019) to (2018,2019) and then merge each melted DataFrame over (:ID,:Year).

Working on sets of variables is a very recent feature of tidyr and hasn’t been released in the stable version yet AFAIK. My point was not that Julia is years behind, but rather that OP can use RCall as an intermediate step for things that haven’t been implemented yet or it’s not clear how to implement them in Julia.

@B.Vangod This is a live demonstration of my workflow, using RCall until I learn how to do things in pure Julia and eventually get rid of R. But don’t hijack other people’s threads like I do

I guess you have some concerns about investing time in learning a package and then (i) the package being abandoned or is not very well maintained with accumulating bugs and (ii) doesn’t receive a lot of attention from the community and falls behind in terms of features. It’s very hard to predict what is going to happen to individual packages but since Julia 1.0 was released the ecosystem is maturing so at the very least you shouldn’t expect packages to break. DataFrames, DataFramesMeta, Plots and the Queryverse have many contributors, have been very reliable for years and you can easily get answers here and on Slack for anything that you are not sure how to implement.

For out-of-memory data you can use JuliaDB, JuliaDBMeta and Query from Queryverse (it works with both DataFrames and JuliaDB). This used to be my choice even for small datasets that fit in memory because I liked the syntax better. However, the transition from Julia 0.6 to Julia 1.0 was a bit tricky and slow and I switched to DataFrames in the meantime. I think now all the issues are resolved so you can also check this package out to see if you like it better that the DataFrames ecosystem. These packages play nice with each other so if you use JuliaDB but want to use a package that accepts DataFrames as an argument you can easily convert back and forth between a DataFrame and a JuliaDB table.

bkamins · June 8, 2019, 8:05am

Yes - this is what would have to be done right now. If this feature would be useful then please open an issue on DataFrames.jl and we can think how to implement it. Alternatively you can stack and then unstack to get what you want:

df = DataFrame(ID=1:2, A2018=1:2, A2019=1:2, B2018=1:2, B2019=1:2)
df2 = stack(df, 2:5)
df2.colkey = first.(String.(df2.variable), 1)
df2.Year = parse.(Int, chop.(String.(df2.variable), head=1, tail=0))
unstack(df2, [:ID, :Year], :colkey, :value)

Of interesting upcoming things you will be able soon to index columns using Regex(see https://github.com/JuliaData/DataFrames.jl/pull/1819) which will make selecting columns meeting some pattern easier.

Topic		Replies	Views
What's the current (spring 2024) canonical approach to data science in Julia? General Usage dataframes	34	4123	April 8, 2024
What is the status of the Plots ecosystem and what package should I use? Visualization	11	3758	April 6, 2020
Gadfly, the native Julia statistical plotting library, adds Julia 1.0 support! Community gadfly	10	2978	December 1, 2021
Julia tips for R useRs \| Codementor Community	5	1610	October 5, 2017
Julia stats, data, ML: expanding usability Statistics statistics	84	5053	October 14, 2021

Please recommend a Julia ecosystem for Statistics

Related topics