How to transform DataFrame back to its data source

How would you approach going back from DataFrame to a vector of Records given this code?

struct Record 
    A::UInt32
    B::Float32
    C::Float32
end

# get bytes from stream; Vector{UInt8}
buffer = rand(UInt8, sizeof(Record) * 5) 

# reinterpret bytes as vector of Records 
records = reinterpret(Record, buffer) 

# build dataframe
df = DataFrame(records)

# how to get back from dataframe to a vector of Records?



The following will work:

[Record(x...) for x in eachrow(df)]

It is a bit dangerous though as the data frame columns have to be in the correct order, i.e., the following is not what you want, but might or might not work depending on types:

[Record(x...) for x in eachrow(select(df, :B, :A, :C))]

The best way would probably to define a Record constructor for reconstructing from data frame rows:

Record(row::DataFrameRow) = Record(row.A, row.B, row.C)

Record.(eachrow(select(df, :B, :A, :C)))  # does the right thing even with permuted columns
2 Likes

This means you need to add the whole heavy DataFrames dependency to where you define Record.
A cleaner, more general, and no-deps approach is to define a kwargs constructor:

@kwdef struct Record 
   A::UInt32
   B::Float32
   C::Float32
end

Then,

[Record(; x...) for x in eachrow(df)]

works no matter the column order.

3 Likes
using Tables
reinterpret(Record, rowtable(df))
1 Like

Not necessarily, as you can define the method later and elsewhere, i.e., in the code that is working with data frames already and requires that functionality.
Would probably also prefer to define constructors close to the definition of a struct, but Julia allows other options here as well.

Agreed, also more general as named tuples are more widespread and not tied to data frames.

1 Like

Thank you all. I learned something from all of you.

Since column order is not a concern in my use case, I ended using @rocco_sprmnt21’s solution because it is much more performant.

Couldn’t we write your second line simply like this:

Record.(eachrow(df))

This would then be faster than the other solutions posted so far.

Sure, just wanted to show that the reconstruction is correct even when you permute the columns.

1 Like