Memory efficient melting and stacking? Reshaping Wide to Long

I need to reshape large dataframes from Wide to Long format.

The input looks like this, with a unique ID per row, and multiple columns with a date in the name and multiple columns without the date.

# R code
my <- data.table(ID=1:5, AA_2001=rnorm(5), BB_2001=rnorm(5), BB_2002=rnorm(5), CC_2001=rnorm(5), CC_2003=rnorm(5), XX=sample(5,5))

# Julia Code
WideDF1 = DataFrame(ID=1:5,AA_2001=randn(5), BB_2001=randn(5), BB_2002=randn(5), CC_2001=randn(5), CC_2003=randn(5), XX=rand(1:5,5))


ID    AA_2001    BB_2001    BB_2002     CC_2001    CC_2003 XX
1  1.6045664 -1.0564860  0.4180849  2.05674938 -0.2889240  4
2  0.2872760  0.6051825 -0.9855341 -1.07268299 -0.4726669  2
3 -0.6698151  0.8730640 -0.7799342 -1.42078540  0.7218626  5
4  1.1806024  1.0190259 -1.5307503 -0.01904991 -1.0250547  1
5 -0.3484850  0.7155088 -0.8453928 -1.10998317  1.4003273  3

And I want to convert it to:

ID XX    year         AA         BB          CC
1  4     2001  1.6045664 -1.0564860  2.05674938
1  4     2002         NA  0.4180849          NA
1  4     2003         NA         NA -0.28892400
2  2     2001  0.2872760  0.6051825 -1.07268299
2  2     2002         NA -0.9855341          NA
2  2     2003         NA         NA -0.47266691
3  5     2001 -0.6698151  0.8730640 -1.42078540
3  5     2002         NA -0.7799342          NA
3  5     2003         NA         NA  0.72186256
4  1     2001  1.1806024  1.0190259 -0.01904991
4  1     2002         NA -1.5307503          NA
4  1     2003         NA         NA -1.02505470
5  3     2001 -0.3484850  0.7155088 -1.10998317
5  3     2002         NA -0.8453928          NA
5  3     2003         NA         NA  1.40032729

I was doing it with R. But the real data is about 1GB or more, you can load it but reshaping it produces memory errors. Then I have had to split it, reshaped each chunk and merged them back. With this method I can achieve it but it’s too slow.

Now I’m trying to move to Julia.

In order to compare times and memory I first show you two codes in R:

# Direct version but slow and needs a lot of memory.
RlongA <- function(my) {
  idvars <- grep("_20[0-9][0-9]$",names(my) , invert = TRUE, value=T)    
  temp <- melt(my, id.vars = idvars)                                     
  temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable),  yyear = sub('.*_', '', variable), variable = NULL)]  
  temp[,var:=factor(var, levels=unique(var))]                            
  return(dcast( temp,   ... ~ var, value.var='value' ))
}
# Alternative, less memory.
# This method keeps the original column order
RLongB <- function(my) {
    mcols <- grep("_20[0-9][0-9]$",names(my), value = T )           # columns to be reshaped
    temp <- melt(my, id.vars = "ID", measure.vars = mcols)          # melt with only one column ID
    temp[, c("variable", "yyear") := tstrsplit(variable, "_")]      		# split column names
    temp[, variable:=factor(variable, levels=unique(variable))]     # coerce names to factors in the same order as my
    return(my[,.SD, .SDcols=-mcols][dcast(temp, ... ~ variable), on = "ID"])            # final dcast and right join  !!!
}

I’m new to Julia but I got to badly translate it to Julia.

function JLongA(Wide)
    tempnames=names(Wide)
    scols=tempnames[Not(occursin.(r"_20[0-9][0-9]$",tempnames))]
    tempDF=melt(Wide, scols)
    tempDF=hcat(tempDF, DataFrame(reduce(vcat, permutedims.(split.(tempDF.variable, '_'))), [:name, :year]))
    tempDF=select!(tempDF, Not(:variable))
    unstack(tempDF, vcat(scols, "year"),  :name, :value)
end
# Alternative
function JLongB(Wide)
    tempnames=names(Wide)
    mcols=tempnames[occursin.(r"_20[0-9][0-9]$",tempnames)]
    scols=tempnames[Not(occursin.(r"_20[0-9][0-9]$",tempnames))]
    tempDF=stack(Wide, mcols, :ID)
    tempDF=hcat(tempDF, DataFrame(reduce(vcat, permutedims.(split.(tempDF.variable, '_'))), [:name, :year]))
    tempDF=select!(tempDF, Not(:variable))
    tempDF2=unstack(tempDF, [:ID, :year], :name, :value)
    return innerjoin(tempDF2, Wide[!,scols], on=:ID)
end

In order to benchmark it I have also created a larger toy example:

# R
nn <- 200		# cols
mm <- 200   # rows
WideDF <- as.data.table(matrix(rnorm(2*nn*mm),mm,2*nn))
nam1 <- paste0(stri_rand_strings(nn, 6, pattern = "[a-z]"),"_",sample(2000:2010,nn,T) )
nam2 <-  paste0("Z", stri_rand_strings(nn, 6, pattern = "[a-z]") )
setnames(WideDF,c(nam1,nam2))
WideDF[,ID:=1:mm]
#Julia Version.
NN=200
MM=200
# Be careful with possible duplicated names, because they are generated randomly.
nam=vcat([randstring('a':'z',6) for _ ∈ 1:NN] .* '_' .* string.(sample(2000:2010, NN)), 'Z' .* [randstring('a':'z',6) for _ ∈ 1:(NN)])
WideDF2=DataFrame( randn(MM, 2*NN), nam)
insert!(WideDF2, 1,1:MM, :ID )

With nn=mm=200 and the β€œdirect code” my computer needs 1.3sec and 80MB in R, and 23sec and 14GB in Julia.
How can I improve it?
The Julia’s β€œdirect code” can’t be applied with larger datasets in my computer.

And with nn=mm=200 and the β€œalternative code” it needs 100ms and 22MB in R and 29ms and 43MB in Julia.
How can I improve it?

With the β€œalternative code” with nn=2000 and mm=2000 then R needs 6 secs and 2.1GB, and JUlia 4.6sec and 3.9GB. The direct code doesn’t work here.
But what I really need to do is nn=2000 and mm=200000. The problem is memory, not time.

I think the best option would be to do it using a database program or something like JuliaDB directly on disk, but they don’t seem to provide this commands.

I have not tried it out, but my approach would be to stack the original DataFrame, then split the column with the original column names (e.g. AA_2001) into 2 parts, followed by an unstack.

Another option would be not using stack and unstack but instead doing everything manually with loops populating a preallocated dataframe with the proper dimensions.
But I need to figure out how to do it, and I don’t know if it’s going to be faster.

It’s not a big problem to make everything manually, but that wouldn’t save you from the lack of memory.

Here is somewhat hairy code

using BenchmarkTools
using DataFrames
using BenchmarkTools
using Random
using StableRNGs
using StatsBase

function gen_example(; NN = 200, MM = 200, rng = Random.GLOBAL_RNG)
    # Be careful with possible duplicated names, because they are generated randomly.
    nam = vcat([randstring(rng, 'a':'z',6) for _ ∈ 1:NN] .* '_' .* string.(sample(rng, 2000:2010, NN)), 'Z' .* [randstring(rng, 'a':'z',6) for _ ∈ 1:(NN)])
    WideDF2 = DataFrame( randn(rng, MM, 2*NN), nam)
    insert!(WideDF2, 1,1:MM, :ID )

    return WideDF2
end

function repin(v, n)
    w = Vector{eltype(v)}(undef, length(v) * n)
    idx = 1
    @inbounds @simd for x in v
        for j in 1:n
            w[idx] = x
            idx += 1
        end
    end
    return w
end

function repout(v, n)
    w = Vector{eltype(v)}(undef, length(v) * n)
    idx = 1
    @inbounds @simd for j in 1:n
        for x in v
            w[idx] = x
            idx += 1
        end
    end
    return w
end

function repmerge(v1, v2, val)
    w = Vector{Union{eltype(v1), Missing}}(undef, length(v1)*length(v2))
    idx = 1
    @inbounds @simd for v in v1
        for x in v2
            w[idx] = x == val ? v : missing
            idx += 1
        end
    end
    return w
end

function repmerge!(w, v1, v2, val)
    idx = 1
    @inbounds @simd for v in v1
        for x in v2
            if x == val
                w[idx] = v
            end
            idx += 1
        end
    end
    return w
end

function JLongC(df)
    tempnames = names(df)
    mcols=tempnames[occursin.(r"_20[0-9][0-9]$",tempnames)]
    scols=tempnames[Not(occursin.(r"_20[0-9][0-9]$",tempnames))]
    splitted = map(x -> (x[1], parse(Int, x[2])), split.(mcols, "_"))
    years = unique(map(x -> x[2], splitted))

    resdf = DataFrame(scols[1] => repin(df[!, scols[1]], length(years)))
    for i in 2:length(scols)
        resdf[scols[i]] = repin(df[!, scols[i]], length(years))
    end
    resdf[:year] = repout(years, size(df, 1))
    for i in 1:length(mcols)
        colname, year = splitted[i]
        origname = mcols[i]
        if colname in names(resdf)
            repmerge!(resdf[!, colname], df[!, origname], years, year)
        else
            resdf[colname] = repmerge(df[!, origname], years, year)
        end
    end
    resdf
end

df = gen_example(NN = 200, MM = 200);

julia> @btime JLongB($df);
  30.448 ms (515721 allocations: 49.56 MiB)

julia> @btime JLongC($df);
  7.081 ms (306583 allocations: 23.59 MiB)

I think, you can try to wrap all vectors in Mmap, maybe this will help you to overcome memory problem.

1 Like

As a side note, maybe it make sense to take another look at your original task, that you are trying to solve, because this structure is very inefficient memory wise. You basically trying to build sparse matrix with dense representation, of course it’ll require huge amount of memory. This is sort of a β€œdata smell”, and good reason to rethink your approach.

1 Like

I’m doing it because I have datasets with the structure:
ID fixedvars… yearvars…
and I want to fit regression models with random effects, and other things, using:
ID fixedvars1 fixedvars2 YEAR
That’s using the year explicitily, and considering the yearvars the same no matter it’s year.
Sometimes instead of β€œ_year” you can have other suffixes indicating other grouping and you want to reshape it.

I’m not 100% sure I understand your end data frame. I echo comments above that this is not a useful form of the data and I’m not sure how it helps you estimate fixed effects

If you are running a regression using GLM or anything like it in the Julia ecosystem, typically you want 1 row of your dataframe to correspond to one observation. If your observations are at the ID-YEAR level, then I think I disagree with the other posters saying that your end result is incorrect because it seems like this is what you would want. However, the other columns do have somewhat of a weird sparsity structure for that (you’d have to drop most of the data?).

How much RAM do you have?

1 Like

I’m doing regression models with repeated measures such as:
V ~ X + Y + Z + XΒ·Y + Year + (1|ID)
but with more variables and complexity.
And also Lasso, ElasticNet, survival models…

Most of my rows contain a lot of missings, but that’s not a problem for many regression libraries. And this problem is compensated by the large amount of rows.

Some of my variables are categorical and happen very few times, sampling is not always good.
I’m comparing different methods to analyse this datasets, alltogether, partitioning, … But I first need to cleanse the data.

Here is a solution

julia> function differentunstack(wide)
       long1 = stack(df, Not([:ID, :XX]))
       s = split.(long1.variable, "_")
       long1.company = getindex.(s, 1)
       long1.year = getindex.(s, 2)
       unstack(t1, [:ID, :year], :company, :value)
       end;

julia> differentunstack(WideDF1)
15Γ—5 DataFrame
β”‚ Row β”‚ ID    β”‚ year     β”‚ AA        β”‚ BB          β”‚ CC         β”‚
β”‚     β”‚ Int64 β”‚ SubStri… β”‚ Float64?  β”‚ Float64?    β”‚ Float64?   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 2001     β”‚ -0.780419 β”‚ -1.29812    β”‚ -0.0343812 β”‚
β”‚ 2   β”‚ 1     β”‚ 2002     β”‚ missing   β”‚ -1.55086    β”‚ missing    β”‚
β”‚ 3   β”‚ 1     β”‚ 2003     β”‚ missing   β”‚ missing     β”‚ -0.220337  β”‚
β”‚ 4   β”‚ 2     β”‚ 2001     β”‚ -0.542673 β”‚ 0.651154    β”‚ 1.05305    β”‚
β”‚ 5   β”‚ 2     β”‚ 2002     β”‚ missing   β”‚ 1.30521     β”‚ missing    β”‚
β”‚ 6   β”‚ 2     β”‚ 2003     β”‚ missing   β”‚ missing     β”‚ 0.40769    β”‚
β”‚ 7   β”‚ 3     β”‚ 2001     β”‚ 0.745416  β”‚ 0.855534    β”‚ 0.792923   β”‚
β”‚ 8   β”‚ 3     β”‚ 2002     β”‚ missing   β”‚ 0.196397    β”‚ missing    β”‚
β”‚ 9   β”‚ 3     β”‚ 2003     β”‚ missing   β”‚ missing     β”‚ -0.961542  β”‚
β”‚ 10  β”‚ 4     β”‚ 2001     β”‚ -1.20617  β”‚ 0.350795    β”‚ -0.468888  β”‚
β”‚ 11  β”‚ 4     β”‚ 2002     β”‚ missing   β”‚ 1.13082     β”‚ missing    β”‚
β”‚ 12  β”‚ 4     β”‚ 2003     β”‚ missing   β”‚ missing     β”‚ 0.204436   β”‚
β”‚ 13  β”‚ 5     β”‚ 2001     β”‚ -0.111618 β”‚ -0.00867672 β”‚ 1.74978    β”‚
β”‚ 14  β”‚ 5     β”‚ 2002     β”‚ missing   β”‚ -1.00071    β”‚ missing    β”‚
β”‚ 15  β”‚ 5     β”‚ 2003     β”‚ missing   β”‚ missing     β”‚ 0.0496623  β”‚

It doesn’t have the :XX variable, which is a bit disappointing. But hopefully you can use this function to get all the other variables as well.

1 Like

There are a couple of ideas that can help.

  1. I do not know much about packages that work with regression models, but I think, that internally they are working with matrices. And while DataFrame are really nice to wrangle data, but they give considerable overhead. So, if you are really limited in resources, you may do all the heavy lifting manually. It means that code would be rather lengthy, compared to DataFrame solution, but your final solution will be relatively lightweight. Here is an example, which is slight modification of the previous code, but instead of output DataFrame, it returns Matrix{Union{Missing, Float64}}
function repin2!(A, i, v, n)
    idx = 1
    @inbounds @simd for x in v
        for j in 1:n
            A[idx, i] = x
            idx += 1
        end
    end
    return A
end

function repout2!(A, i, v, n)
    idx = 1
    @inbounds @simd for j in 1:n
        for x in v
            A[idx, i] = x
            idx += 1
        end
    end
    return A
end

function repmerge2!(A, i, v1, v2, val)
    idx = 1
    @inbounds @simd for v in v1
        for x in v2
            if x == val
                A[idx, i] = v
            end
            idx += 1
        end
    end
    return A
end

function JLongD(df)
    tempnames = names(df)
    mcols=tempnames[occursin.(r"_20[0-9][0-9]$",tempnames)]
    scols=tempnames[Not(occursin.(r"_20[0-9][0-9]$",tempnames))]
    splitted = map(x -> (x[1], parse(Int, x[2])), split.(mcols, "_"))
    years = unique(map(x -> x[2], splitted))
    nyears = length(years)
    nmcols = length(unique(map(x -> x[1], splitted)))

    A = Matrix{Union{Float64, Missing}}(undef, size(df, 1)*nyears, length(scols) + nmcols + 1)
    A .= missing

    for i in eachindex(scols)
        repin2!(A, i, df[!, scols[i]], length(years))
    end
    offset = length(scols) + 1
    repout2!(A, offset, years, size(df, 1))

    dd = Dict{String, Int}()
    res = Vector{Int}(undef, length(mcols))
    cnt = 0
    for (i, el) in enumerate(splitted)
        if !haskey(dd, el[1])
            cnt += 1
            dd[el[1]] = cnt
        end
        res[i] = dd[el[1]]
    end
    res .+= offset

    for i in eachindex(res)
        _, year = splitted[i]
        origname = mcols[i]
        repmerge2!(A, res[i], df[!, origname], years, year)
    end
    A
end

and here is a comparison

df = gen_example(NN = 2000, MM = 2000);
@btime JLongC($df);
  1.465 s (30071057 allocations: 2.27 GiB)

@btime JLongD($df);
  424.955 ms (54077 allocations: 758.58 MiB)

Three times faster compared to the DataFrame solution and three times less memory usage. Also, maybe regression package can work with sparse data, then you can use sparse matrices and save some memory.

  1. If you are really tight on resources, you can think about other approaches to your task. If you can reformulate your question as the task of finding minimal value, then you can use SGD (stochastic gradient descend) algorithm. Then, you will be able to process your data in batches and convert original data to the needed format on the fly.

Anyway, sizeof(JLongD(df))/1024/1024 = 671.7 Mb, so if you have 100x rows, size of your final table would be ~70Gb.

1 Like

Great, thank you.
How do you know the column names?

One of the reasons of reshaping my data is because initially has too many columns, for example 2000, an it’s very tedious to inspect them.
If I convert it to long format I may reduce its number to say 300, which is much more workable.
The idea is to transform it and save it to disk.
And then I can load just the subset of columns I need as a dataframe, for example 20, for further analysis.
Sometimes there are additional problems, for example the columns can use different codifications of formats even if they are supposed to be the same.

Your new example works on my computer with Julia v1.6 but strangely I can’t see that big difference between JLongC and JLongD with NN=MM=2000.
For me JLongC is:
654.882 ms (6045054 allocations: 945.05 MiB)
and JLongD is:
391.984 ms (29577 allocations: 757.23 MiB)
It seems that JLongC is not that bad with v1.6.