Construct Julia Dataframe from row data

When I have data in columns, I can construct a DataFrame like this:

using Dataframes
nameData = ["Alice", "Bob"]
surnameData = ["Smith", "Jones"]
df = DataFrame(name = nameData, surname = surnameData)

However I have data in rows:

row1 = ["Alice", "Smith"]
row2 = ["Bob", "Jones"]
df = DataFrame(???)

How can I make a DataFrame from it?

To generalize a question I have a large amount of row-aligned data in memory but outside of Julia and I need to construct a Julia DataFrame from it. What’s the fastest way to do that? Rows are always the same size but can contain different data types. Same column is always the same datatype for all rows.

Two options

  1. Initialize a data frame with no rows like DataFrame(name = String[], surname = String[], then do
for row in rows # your collection of rows
    push!(df, row)
end
  1. Write a small helper function that turns each row into a named tuple
function make_better_row(vec)
    (name = vec[1], surname = vec[2])
end

with this, you don’t have to worry about the types of vectors when you initialize the dataframe. you can do

df = DataFrame()
for row in rows
    push!(df, make_better_row(row))
end

This is all assuming your rows come to you where each row is a vector. But you get the idea: push a named tuple to an empty data frame (i.e. df = DataFrame()), but if you are pushing a vector or an ordinary tuple, initialize the dataframe’s columns first.

1 Like

If you can put rows into a vector of named tuples:

julia> row1 = (name="Alice", age=20)
julia> row2 = (name="Bob",   age=30)

julia> DataFrame([row1, row2])
2Γ—2 DataFrame
β”‚ Row β”‚ name   β”‚ age   β”‚
β”‚     β”‚ String β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ Alice  β”‚ 20    β”‚
β”‚ 2   β”‚ Bob    β”‚ 30    β”‚

Is this of any use here

And what is the problem of converting rows to columns?
You can push rows to DataFrame as @pdeffebach describes, But this may not be very effective, because it will lead to several resizing of columns vectors.
I would do something like that (I assume that rows is an array of arrays):

julia> columns = Vector{Vector}(undef, 0)
julia> for (i,r) in enumerate(first(rows))
           column = Vector{typeof(r)}(undef, length(rows))
           column .= getindex.(rows, i)
           push!(columns, column)
       end
julia> df = DataFrame(columns, column_names, copycols=false)

column_names must be Vector{Symbol}, not Vector{String}.

One thing to remember is that Julia is very good at starting something from nothing and push!ing them, julia doubles the memory used by a vector each time it runs out of space. So it’s performance shouldn’t be an issue.

But this also works.

1 Like

Since this push! is pushing to DataFrame, it is possible that there can be other implementation issues. At least there is a recommendation in DataFrames manual to use columns approach: https://juliadata.github.io/DataFrames.jl/stable/man/getting_started/#Constructing-Row-by-Row-1

It depends on the number of rows

It was this process that I meant by β€œseveral resizes”. This is the usual behavior for arrays in most languages. It is not as fast as it seems. Each time you need to allocate a new memory block and copy data to it . Both are not instant operations at all. Assume that you already have an array of 2 ^ 30 bytes and you need to allocate 2 GB and copy 1 GB at next push!.. So if you know the size of the array, it’s a good idea to preallocate it. sizehint! designed just for this.

1 Like

Yet another way:

rows = [["Alice", "Smith"], ["Bob", "Jones"]]
DataFrame((name = r[1], surname = r[2]) for r in rows)
1 Like

Thank you for your suggestions. I’m checking them out.

Meanwhile I have continued my search and found out that CSV package also produces DataFrames and (presumably) engineered for row-aligned data. It also accepts IOBuffer as an input. However I encountered a problem.
This code works and produces 2x2 DataFrame:

using CSV
rowData = "Alice,20\nBob,30"
CSV.read(IOBuffer(rowData), header=["Name","Age"])

But this code does not:

using CSV
rowData = "Alice,20\nBob,30"
io = IOBuffer()
write(io, rowData)
CSV.read(io, header=["Name","Age"])

Or rather it produces 0x2 DataFrame without rows. Why?

As I can potentially have a bigger dataset, I’d prefer to fill IOBuffer gradually, rather than once at initialization.

I’m guessing because the io pointer is at the end of the buffer, so when the reader looks, it sees the end immediately and closes. Try seek(io, 0) before calling CSV.read() to move the pointer back to the beginning

julia> using CSV
julia> rowData = "Alice,20\nBob,30"
"Alice,20\nBob,30"
julia> io = PipeBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=false, append=true, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> write(io, rowData)
15
julia> CSV.read(io, header=["Name","Age"])
2Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ Name   β”‚ Age   β”‚
β”‚     β”‚ String β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ Alice  β”‚ 20    β”‚
β”‚ 2   β”‚ Bob    β”‚ 30    β”‚