To generalize a question I have a large amount of row-aligned data in memory but outside of Julia and I need to construct a Julia DataFrame from it. Whatβs the fastest way to do that? Rows are always the same size but can contain different data types. Same column is always the same datatype for all rows.
Initialize a data frame with no rows like DataFrame(name = String[], surname = String[], then do
for row in rows # your collection of rows
push!(df, row)
end
Write a small helper function that turns each row into a named tuple
function make_better_row(vec)
(name = vec[1], surname = vec[2])
end
with this, you donβt have to worry about the types of vectors when you initialize the dataframe. you can do
df = DataFrame()
for row in rows
push!(df, make_better_row(row))
end
This is all assuming your rows come to you where each row is a vector. But you get the idea: push a named tuple to an empty data frame (i.e. df = DataFrame()), but if you are pushing a vector or an ordinary tuple, initialize the dataframeβs columns first.
And what is the problem of converting rows to columns?
You can push rows to DataFrame as @pdeffebach describes, But this may not be very effective, because it will lead to several resizing of columns vectors.
I would do something like that (I assume that rows is an array of arrays):
julia> columns = Vector{Vector}(undef, 0)
julia> for (i,r) in enumerate(first(rows))
column = Vector{typeof(r)}(undef, length(rows))
column .= getindex.(rows, i)
push!(columns, column)
end
julia> df = DataFrame(columns, column_names, copycols=false)
column_names must be Vector{Symbol}, not Vector{String}.
One thing to remember is that Julia is very good at starting something from nothing and push!ing them, julia doubles the memory used by a vector each time it runs out of space. So itβs performance shouldnβt be an issue.
Since this push! is pushing to DataFrame, it is possible that there can be other implementation issues. At least there is a recommendation in DataFrames manual to use columns approach: Getting Started Β· DataFrames.jl
It was this process that I meant by βseveral resizesβ. This is the usual behavior for arrays in most languages. It is not as fast as it seems. Each time you need to allocate a new memory block and copy data to it . Both are not instant operations at all. Assume that you already have an array of 2 ^ 30 bytes and you need to allocate 2 GB and copy 1 GB at next push!.. So if you know the size of the array, itβs a good idea to preallocate it. sizehint! designed just for this.
Thank you for your suggestions. Iβm checking them out.
Meanwhile I have continued my search and found out that CSV package also produces DataFrames and (presumably) engineered for row-aligned data. It also accepts IOBuffer as an input. However I encountered a problem.
This code works and produces 2x2 DataFrame:
using CSV
rowData = "Alice,20\nBob,30"
CSV.read(IOBuffer(rowData), header=["Name","Age"])
Iβm guessing because the io pointer is at the end of the buffer, so when the reader looks, it sees the end immediately and closes. Try seek(io, 0) before calling CSV.read() to move the pointer back to the beginning