Converting from Wide to Long format

Hello.

Some time ago I dealt with the problem of converting a large dataset from Wide format to Long format, sometimes called pivoting.
I did that first steps in R but it’s too slow.

Let’s say we have this toy example in R:

mydata ← data.frame(ID=1:5, ZA_1=1:5,
ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1 1 5 3 6 6
2 2 4 3 6 5
3 3 3 3 6 4
4 4 2 3 6 3
5 5 1 3 6 2

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2

The result is going to be something like this:

ID measure ZA BB CC
 1 1        1  3 NA
 1 2        5  6 NA
 1 7       NA NA  6
 2 1        2  3 NA
 2 2        4  6 NA
 2 7       NA NA  5
 3 1        3  3 NA
 3 2        3  6 NA
 3 7       NA NA  4
14 1        4  3 NA
14 2        2  6 NA
14 7       NA NA  3
15 1        5  3 NA
15 2        1  6 NA
15 7       NA NA  2

There are some variables that will remain as is (here only ID) and some that will be transformed to long format (here all other variables, all ending with _1, _2 or _7).

I need an automated way to do it because I need to apply the method to datasets with thousands of variables.

Uwe Block at stackoverflow wrote an efficient way to do it with R that doesn’t need as much memory the usual two-steps suggested method (first getting a very long format and then reshaping it to the shape we want):

setDT(mydata)

add unique row number to join on later

(leave ID col as placeholder for all other id.vars)

mydata[, rn := seq_len(.N)]

define columns to be reshaped

measure_cols ← stringr::str_subset(names(mydata), “_\d$”)

melt with only one id.vars column

molten ← melt(mydata, id.vars = “rn”, measure.vars = measure_cols)

split column names of measure.vars

Note that “variable” is reused to save memory

molten[, c(“variable”, “measure”) := tstrsplit(variable, “_”)]

coerce names to factors in the same order as the columns appeared in mydata

molten[, variable := forcats::fct_inorder(variable)]

remove columns no longer needed in mydata before joining to save memory

mydata[, (measure_cols) := NULL]

final dcast and right join

result ← mydata[dcast(molten, … ~ variable), on = “rn”]
result

What would be the best way to do it in Julia?

See stack and melt in the DataFrames manual.