Hello.
Some time ago I dealt with the problem of converting a large dataset from Wide format to Long format, sometimes called pivoting.
I did that first steps in R but it’s too slow.
Let’s say we have this toy example in R:
mydata ← data.frame(ID=1:5, ZA_1=1:5,
ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)
ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1 1 5 3 6 6
2 2 4 3 6 5
3 3 3 3 6 4
4 4 2 3 6 3
5 5 1 3 6 2
ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1 1 5 3 6 6
2 2 4 3 6 5
3 3 3 3 6 4
4 4 2 3 6 3
5 5 1 3 6 2
The result is going to be something like this:
ID measure ZA BB CC
1 1 1 3 NA
1 2 5 6 NA
1 7 NA NA 6
2 1 2 3 NA
2 2 4 6 NA
2 7 NA NA 5
3 1 3 3 NA
3 2 3 6 NA
3 7 NA NA 4
14 1 4 3 NA
14 2 2 6 NA
14 7 NA NA 3
15 1 5 3 NA
15 2 1 6 NA
15 7 NA NA 2
There are some variables that will remain as is (here only ID) and some that will be transformed to long format (here all other variables, all ending with _1, _2 or _7).
I need an automated way to do it because I need to apply the method to datasets with thousands of variables.
Uwe Block at stackoverflow wrote an efficient way to do it with R that doesn’t need as much memory the usual two-steps suggested method (first getting a very long format and then reshaping it to the shape we want):
setDT(mydata)
add unique row number to join on later
(leave
ID
col as placeholder for all other id.vars)mydata[, rn := seq_len(.N)]
define columns to be reshaped
measure_cols ← stringr::str_subset(names(mydata), “_\d$”)
melt with only one id.vars column
molten ← melt(mydata, id.vars = “rn”, measure.vars = measure_cols)
split column names of measure.vars
Note that “variable” is reused to save memory
molten[, c(“variable”, “measure”) := tstrsplit(variable, “_”)]
coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]
remove columns no longer needed in mydata before joining to save memory
mydata[, (measure_cols) := NULL]
final dcast and right join
result ← mydata[dcast(molten, … ~ variable), on = “rn”]
result
What would be the best way to do it in Julia?