DataFrame construction from array of tuples

Hi all. What’s the best way to convert an array of tuples into a DataFrame?

For example, I’d like to construct a data frame with column names :a and :b for the data below:

julia> data = [(1,2),(4,5)]
2-element Array{Tuple{Int64,Int64},1}:
 (1, 2)
 (4, 5)
1 Like

df = DataFrame(a = [x[1] for x in data], b = [x[2] for x in data]) would work

function tupe2mat(data)
       n = length(data[1])
       m = ones(length(data), n)
       for i in 1:length(data)
            m[i,:] = [t for t in data[i]]
       end
       return m
end
df = DataFrame(map(idx -> getindex.(data, idx), eachindex(data)), [:a, :b])

Thanks. That’s what I thought as well but I felt it’s somewhat inefficient as it has to go through a loop for each comprehension. The real case is that I have 100k rows and many columns. Further, having to type x[1] for x in data, x[2] for x in data, etc. is a bit tiresome…

Hi,

I couldn’t get this to work with a more elaborate example:

julia> data = [("a",1,2),("b",4,5),("c",6,7),("d",8,9)]
4-element Array{Tuple{String,Int64,Int64},1}:
 ("a", 1, 2)
 ("b", 4, 5)
 ("c", 6, 7)
 ("d", 8, 9)

julia> DataFrame(map(idx -> getindex.(data, idx), eachindex(data)), [:a, :b, :c])
ERROR: BoundsError: attempt to access ("a", 1, 2)
  at index [4]
Stacktrace:
 [1] getindex(::Tuple{String,Int64,Int64}, ::Int64) at ./tuple.jl:21
 [2] broadcast_t(::Function, ::Type{Any}, ::Tuple{Base.OneTo{Int64}}, ::CartesianRange{CartesianIndex{1}}, ::Array{Tuple{String,Int64,Int64},1}, ::Int64) at ./broadcast.jl:258
 [3] broadcast_c at ./broadcast.jl:321 [inlined]
 [4] broadcast(::Function, ::Array{Tuple{String,Int64,Int64},1}, ::Int64) at ./broadcast.jl:455
 [5] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{Base.OneTo{Int64},##19#20}, ::Int64, ::Int64) at ./array.jl:508
 [6] collect_to!(::Array{Array{String,1},1}, ::Base.Generator{Base.OneTo{Int64},##19#20}, ::Int64, ::Int64) at ./array.jl:518
 [7] _collect(::Base.OneTo{Int64}, ::Base.Generator{Base.OneTo{Int64},##19#20}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:489
 [8] map(::Function, ::Base.OneTo{Int64}) at ./abstractarray.jl:1868

My bad, it should be,
df = DataFrame(map(idx -> getindex.(data, idx), eachindex(first(data))), Names).
The length is given by the size of the elements in data not the length of data itself.

3 Likes

For what it’s worth, here is a solution using array comprehensions only.

df = DataFrame([[data[k][kk] for k in 1:length(data)] for kk in 1:length(data[1])], [:a, :b, :c])

Just a thought: it seems to be the inverse operation of the zip function. I was thinking there should be already a function for that, and I looked it up before coming with a custom solution to it. I may not be the only person. So what I am wondering is: should one call that function unzip and make a PR to, for instance, DataFrames,jl? Thoughts?

1 Like

I needed this and this thread came on top in Google, so here are some nice alternatives for reference:

data = [(1,2), (4,5)]

# To use each tuple as a row
DataFrame(NamedTuple{(:a, :b)}.(data))

# or
DataFrame((; a, b) for (a,b) in data)

# To use each tuple as a column
DataFrame(collect.(data), [:a, :b])

# Edit: or to avoid a useless copy
DataFrame(collect.(data), [:a, :b], copycols=false)
9 Likes

@sijo, thanks for sharing this excellent stuff.

After playing with tuple data similar to @tk3369’s second example above, there seems to be a massive time difference between using tuples as rows versus as columns in building the DataFrames. Wondering if there are better solutions for the columns case?

using DataFrames

function tuple2df_rows(data,Names)
    df = DataFrame(data)
    rename!(df, Names)
end

function tuple2df_columns(data, Names)
   df = DataFrame(collect.(data))
   insertcols!(df, 1, :name =>Names)
   return df
end

vstr = ("a", "b","c","d","e")
data = [(rand(vstr),rand(Int8),rand(Int8)) for _ in 1:100_000]
Names = ["Type","Outcome-1","Outcome-2"]

using BenchmarkTools
@btime tuple2df_rows($data,$Names)  # 336 μs (81 allocations: 981.4 KiB)
@btime tuple2df_columns($data,$Names)   # 56.88 ms (799033 allocations: 51.5 MiB)

julia> data
100000-element Vector{Tuple{String, Int8, Int8}}:
 ("a", -59, -106)
 ("c", 30, -24)
 ("c", 31, -108)
 ("d", -96, 90)
 ("c", -57, 70)
 ("b", -15, 47)
 ("c", -108, -125)
 ("e", 16, -23)
 ("b", -2, -77)
 ("a", 127, 70)
 ⋮
 ("e", -81, -99)
 ("a", -119, -60)
 ("a", 108, 21)
 ("d", -38, -11)
 ("c", 100, -91)
 ("c", -36, -115)
 ("d", 76, 13)
 ("b", 105, 94)
 ("b", -42, 42)

julia> tuple2df_rows(data,Names)
100000×3 DataFrame
    Row │ Type    Outcome-1  Outcome-2 
        │ String  Int8       Int8      
────────┼──────────────────────────────
      1 │ a             -59       -106
      2 │ c              30        -24
      3 │ c              31       -108
      4 │ d             -96         90
      5 │ c             -57         70
      6 │ b             -15         47
      7 │ c            -108       -125
      8 │ e              16        -23
   ⋮    │   ⋮         ⋮          ⋮
  99994 │ a             108         21
  99995 │ d             -38        -11
  99996 │ c             100        -91
  99997 │ c             -36       -115
  99998 │ d              76         13
  99999 │ b             105         94
 100000 │ b             -42         42
                     99985 rows omitted

julia> tuple2df_columns(data)
3×100001 DataFrame
 Row │ name       x1    x2   x3    x4   x5   x6   x7    x8   x9   x10  x11   x12  x13   x14  x15  x16   x17  x18  x19   ⋯
     │ String     Any   Any  Any   Any  Any  Any  Any   Any  Any  Any  Any   Any  Any   Any  Any  Any   Any  Any  Any   ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────  
   1 │ Type       a     c    c     d    c    b    c     e    b    a    a     d    c     e    e    c     a    d    e     ⋯ 
   2 │ Outcome-1  -59   30   31    -96  -57  -15  -108  16   -2   127  85    119  12    113  117  -106  -53  -86  126    
   3 │ Outcome-2  -106  -24  -108  90   70   47   -125  -23  -77  70   -122  44   -114  65   -51  1     73   -97  -26
                     99981 columns omitted
        

The column case in this example is a bit unusual: 100,000 columns of only three values which are of different type. I think it doesn’t fit well the design of DataFrames.jl: normally you want all values in a column to have the same type. DataFrames.jl stores data as a list of columns (a vector per column). Here each column has to be of type Any, so every value is boxed which is very inefficient.

But DataFrame(collect.(data)) works fine if each column (each tuple in data) has all values of the same type:

# This doesn't print well in the REPL: tuples are not abbreviated like arrays
data_columns = [tuple((row[i] for row in data)...) for i in 1:3];

# Just to see what is inside `data_columns`
julia> collect.(data_columns)
3-element Vector{Vector{T} where T}:
 ["a", "e", "c", "b", "b", "d", "b", "d", "a", "e"  …  "e", "b", "c", "c", "c", "c", "b", "b", "b", "c"]
 Int8[-108, -127, 27, 123, -73, 77, 126, 12, -70, -101  …  -78, 105, -90, -81, -114, 41, 57, -63, -19, -77]
 Int8[-66, -77, 51, 5, 27, -62, 0, 11, -107, -75  …  115, 5, -85, 93, -75, -97, -92, 52, -121, -116]

It is the same data but organized as three tuples of 100,000 values. This gives good performance:

tuple2df_columns2(data, names) = DataFrame(collect.(data), names)

julia> @btime tuple2df_rows(data, Names)
  457.372 μs (79 allocations: 981.38 KiB)

julia> @btime tuple2df_columns2(data_columns, Names)
  411.203 μs (53 allocations: 1.91 MiB)

And it can be further improved by letting the data frame store directly the vectors created by collect:

tuple2df_columns2_nocopy(data, names) = DataFrame(collect.(data), names, copycols=false)

julia> @btime tuple2df_columns2_nocopy(data_columns, Names)
  314.103 μs (47 allocations: 979.73 KiB)
4 Likes

@sijo, it is a bit late here but if I am not mistaken the last examples provided they aim at further improving the function tuple2df_rows(), but not the other one sought.
I.e., the functions are labelled as “columns2” but they result in “row” DataFrames:

julia> @btime tuple2df_columns2_nocopy($data_columns, $Names)
  220.100 μs (47 allocations: 979.73 KiB)
100000×3 DataFrame
    Row │ Type    Outcome-1  Outcome-2 
        │ String  Int8       Int8      
────────┼──────────────────────────────
      1 │ a             -59       -106
      2 │ c              30        -24
      3 │ c              31       -108
      4 │ d             -96         90
      5 │ c             -57         70
      6 │ b             -15         47
      7 │ c            -108       -125
      8 │ e              16        -23
      9 │ b              -2        -77
     10 │ a             127         70
     11 │ a              85       -122
   ⋮    │   ⋮         ⋮          ⋮
  99991 │ a             -76         46
  99992 │ e             -81        -99
  99993 │ a            -119        -60
  99994 │ a             108         21
  99995 │ d             -38        -11
  99996 │ c             100        -91
  99997 │ c             -36       -115
  99998 │ d              76         13
  99999 │ b             105         94
 100000 │ b             -42         42
                     99979 rows omitted

Also, the cost of converting the original 100000-element vector of tuples to data_columns (3-element vector of tuples) seems to be quite high, and not factored-in in the benchmark above.

@rafael.guerra ah so you really want your table to look like this:

"a"	"b"	"c"
10	20	30

This is not a case I had in mind… I think with DataFrames.jl it’s better to work with the transposed data:

"a"	10
"b"	20
"c"	30

Then each column is a vector with uniform concrete type and performance will be much better.

I’d be curious to see a real-life problem where working with the transpose doesn’t work well…

2 Likes

There’s also the rather simple

julia> rename(DataFrame([("a", 1), ("b", 2)]), [:x, :y])
2×2 DataFrame
 Row │ x       y     
     │ String  Int64 
─────┼───────────────
   1 │ a           1
   2 │ b           2

Not sure if a copy of the df is made when renaming; if it is you can fix that by assigning the original df to a variable and then calling rename! on it.