Stack via flatmap

using CSV, DataManipulation, StructArrays, Tables

julia> @p begin
           """
           Year,CountryName,Population!!Estimate,Population!!MarginOfError,GDP!!Estimate,GDP!!MarginOfError
           2025,China,1400000000,100000000,18000000000000,1000000000000
           2025,India,1400000000,100000000,3000000000000,1000000000000
           """
           StructArray(columntable(CSV.File(IOBuffer(__))))
           flatmap() do r
               [
                   (;r.Year, r.CountryName, k => getproperty(r, k))
                   for k in propertynames(r)[3:end]
               ]
           end
       end
8-element Vector{NamedTuple{names, Tuple{Int64, String7, Int64}} where names}:
 (Year = 2025, CountryName = "China", Population!!Estimate = 1400000000)
 (Year = 2025, CountryName = "China", Population!!MarginOfError = 100000000)
 (Year = 2025, CountryName = "China", GDP!!Estimate = 18000000000000)
 (Year = 2025, CountryName = "China", GDP!!MarginOfError = 1000000000000)
 (Year = 2025, CountryName = "India", Population!!Estimate = 1400000000)
 (Year = 2025, CountryName = "India", Population!!MarginOfError = 100000000)
 (Year = 2025, CountryName = "India", GDP!!Estimate = 3000000000000)
 (Year = 2025, CountryName = "India", GDP!!MarginOfError = 1000000000000)

This is DataFrames.stack implemented with flatmap. It works fine but I don’t like the code style.

  • It uses column names for two and indexes for the rest. I’d rather (1) use the two names and refer to the rest by omission, or (2) say 1:2 and 3:end.
  • It requires getproperty - ugly.
  • It requires a list comprehension - verbose.

Is there a nicer way to do this?

cc @aplavin

Hmm, are you sure that’s the result you want? In your example, the last field name is different for every row, it will likely be inconvenient to work with that downstream.

For example, this one would select all columns with !! for flattening, put the original column names as col and values as val:

flatmap(pairs(_[sr".*!!.*"]), (;_.Year, _.CountryName, col=_2.first, val=_2.second))

Btw, for more convenient processing, you may want to immediately combine estimate + error into one Julian object.

And one minor thing: you can just pass StructArray to CSV.read directly.

You’re right of course, I meant

flatmap() do r
    [(;r.Year, r.CountryName, :col => p, :val => getproperty(r, p))
        for p in propertynames(r)[3:end]]
end
┌───────┬─────────────┬───────────────────────────┬────────────────┐
│  Year │ CountryName │                       col │            val │
│ Int64 │     String7 │                    Symbol │          Int64 │
├───────┼─────────────┼───────────────────────────┼────────────────┤
│  2025 │       China │      Population!!Estimate │     1400000000 │
│  2025 │       China │ Population!!MarginOfError │      100000000 │
│  2025 │       China │             GDP!!Estimate │ 18000000000000 │
│  2025 │       China │        GDP!!MarginOfError │  1000000000000 │
│  2025 │       India │      Population!!Estimate │     1400000000 │
│  2025 │       India │ Population!!MarginOfError │      100000000 │
│  2025 │       India │             GDP!!Estimate │  3000000000000 │
│  2025 │       India │        GDP!!MarginOfError │  1000000000000 │
└───────┴─────────────┴───────────────────────────┴────────────────┘

or better

flatmap() do r
    [(;r.Year, r.CountryName, :col => k, :val => v)
        for (k,v) in collect(pairs(r))[3:end]]
end

which solves #2 of my three gripes.

_[sr".*!!.*"]) solves #1. And the 3-arg flatmap method solves #3.

Superb. Thank you!

Do you have in mind Measurements.jl measurement(_, _) or something else?