Construct Julia Dataframe from row data

nnn · March 19, 2020, 7:06pm

When I have data in columns, I can construct a DataFrame like this:

using Dataframes
nameData = ["Alice", "Bob"]
surnameData = ["Smith", "Jones"]
df = DataFrame(name = nameData, surname = surnameData)

However I have data in rows:

row1 = ["Alice", "Smith"]
row2 = ["Bob", "Jones"]
df = DataFrame(???)

How can I make a DataFrame from it?

To generalize a question I have a large amount of row-aligned data in memory but outside of Julia and I need to construct a Julia DataFrame from it. What’s the fastest way to do that? Rows are always the same size but can contain different data types. Same column is always the same datatype for all rows.

pdeffebach · March 19, 2020, 7:31pm

Two options

Initialize a data frame with no rows like DataFrame(name = String[], surname = String[], then do

for row in rows # your collection of rows
    push!(df, row)
end

Write a small helper function that turns each row into a named tuple

function make_better_row(vec)
    (name = vec[1], surname = vec[2])
end

with this, you don’t have to worry about the types of vectors when you initialize the dataframe. you can do

df = DataFrame()
for row in rows
    push!(df, make_better_row(row))
end

This is all assuming your rows come to you where each row is a vector. But you get the idea: push a named tuple to an empty data frame (i.e. df = DataFrame()), but if you are pushing a vector or an ordinary tuple, initialize the dataframe’s columns first.

lbilli · March 19, 2020, 7:40pm

If you can put rows into a vector of named tuples:

julia> row1 = (name="Alice", age=20)
julia> row2 = (name="Bob",   age=30)

julia> DataFrame([row1, row2])
2×2 DataFrame
│ Row │ name   │ age   │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ Alice  │ 20    │
│ 2   │ Bob    │ 30    │

johnh · March 19, 2020, 8:35pm

Is this of any use here

waralex · March 19, 2020, 9:28pm

And what is the problem of converting rows to columns?
You can push rows to DataFrame as @pdeffebach describes, But this may not be very effective, because it will lead to several resizing of columns vectors.
I would do something like that (I assume that rows is an array of arrays):

julia> columns = Vector{Vector}(undef, 0)
julia> for (i,r) in enumerate(first(rows))
           column = Vector{typeof(r)}(undef, length(rows))
           column .= getindex.(rows, i)
           push!(columns, column)
       end
julia> df = DataFrame(columns, column_names, copycols=false)

column_names must be Vector{Symbol}, not Vector{String}.

pdeffebach · March 19, 2020, 9:40pm

One thing to remember is that Julia is very good at starting something from nothing and push!ing them, julia doubles the memory used by a vector each time it runs out of space. So it’s performance shouldn’t be an issue.

But this also works.

Skoffer · March 19, 2020, 9:46pm

Since this push! is pushing to DataFrame, it is possible that there can be other implementation issues. At least there is a recommendation in DataFrames manual to use columns approach: Getting Started · DataFrames.jl

waralex · March 19, 2020, 10:09pm

It depends on the number of rows

It was this process that I meant by “several resizes”. This is the usual behavior for arrays in most languages. It is not as fast as it seems. Each time you need to allocate a new memory block and copy data to it . Both are not instant operations at all. Assume that you already have an array of 2 ^ 30 bytes and you need to allocate 2 GB and copy 1 GB at next push!.. So if you know the size of the array, it’s a good idea to preallocate it. sizehint! designed just for this.

tkf · March 19, 2020, 10:18pm

Yet another way:

rows = [["Alice", "Smith"], ["Bob", "Jones"]]
DataFrame((name = r[1], surname = r[2]) for r in rows)

nnn · March 20, 2020, 4:21pm

Thank you for your suggestions. I’m checking them out.

Meanwhile I have continued my search and found out that CSV package also produces DataFrames and (presumably) engineered for row-aligned data. It also accepts IOBuffer as an input. However I encountered a problem.
This code works and produces 2x2 DataFrame:

using CSV
rowData = "Alice,20\nBob,30"
CSV.read(IOBuffer(rowData), header=["Name","Age"])

But this code does not:

using CSV
rowData = "Alice,20\nBob,30"
io = IOBuffer()
write(io, rowData)
CSV.read(io, header=["Name","Age"])

Or rather it produces 0x2 DataFrame without rows. Why?

As I can potentially have a bigger dataset, I’d prefer to fill IOBuffer gradually, rather than once at initialization.

kevbonham · March 21, 2020, 12:36am

I’m guessing because the io pointer is at the end of the buffer, so when the reader looks, it sees the end immediately and closes. Try seek(io, 0) before calling CSV.read() to move the pointer back to the beginning

waralex · March 21, 2020, 8:34pm

julia> using CSV
julia> rowData = "Alice,20\nBob,30"
"Alice,20\nBob,30"
julia> io = PipeBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=false, append=true, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> write(io, rowData)
15
julia> CSV.read(io, header=["Name","Age"])
2×2 DataFrames.DataFrame
│ Row │ Name   │ Age   │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ Alice  │ 20    │
│ 2   │ Bob    │ 30    │

Topic		Replies	Views
Initializing a dataframe New to Julia	23	11026	March 15, 2020
How to initialize empty dataframe of specified size New to Julia dataframes	4	2790	August 31, 2021
Help with filling dataframe General Usage dataframes	9	1013	February 23, 2021
Can I have vectors in DataFrame cells? New to Julia dataframes	4	2139	April 3, 2021
I have an array of 31 arrays and would like to make it a DataFrame. Need some help General Usage	4	293	April 6, 2020

Construct Julia Dataframe from row data

Related topics