Mapping Vector{MyType} to a DataFrame

I have a vector with a custom type (struct).
Suppose I want the data in a DataFrame. Can I do this more efficiently than below? (i.e. without copying data)

Of course, I could generate a DataFrame in the beginning (instead of a Vector of some type). But a struct provides more flexibility and can hold, e.g., a Matrix as well.

using BenchmarkTools 
using DataFrames 

struct Mine
    a::Int 
    b::Float64
    c::Matrix{Float64}
end 

myVector=Vector{Mine}()
for i=1:1000000
    push!(myVector,Mine(rand(1:10),rand(),rand(2,2)))
end

function flatten_to_DF(x)
    df=DataFrame()
    for h in ["a","b"]
        hSymbol=Symbol(h)
        df[hSymbol]=map(z->getfield(z,hSymbol),x)
    end   
    return df
end

@btime d=flatten_to_DF(myVector)
1 Like

Hmmm, I am not sure if the DataFrame is, or is not, copying the result of your map (an unnecessary copy). The setindex! documentation seem to be under construction. Try to extract the desired two columns with maps like the one you used but to intermediary Vectors (instead to the DataFrame directly), and then use the DataFrame constructor passing a Vector of the columns, a Vector of the column names, and the keyword argument copycols = false to guarantee that, at least, this copy is not being done.

I would also suggest that you check the time using a loop instead of map. The loop would push! the desired fields of each struct to separate Vectors (declare them with the field type for more efficiency) and you would need not to iterate the Vector of structs multiple times (one for each field) as the map is doing.

If I was not in a hurry I would have coded it for your as I am curious about the size of the difference, but this is the best I have to offer now.

Maybe the StructsOfArrays module is also of your interest.

thanks. I am not quite sure I understand what you mean. But I have two alternative versions, one of which performs worse and the other one performs similar.
I am not sure where I would use the copycols argument you mention

using BenchmarkTools 
using DataFrames 
using Random 

struct Mine
    a::Int 
    b::Float64
    c::Matrix{Float64}
end 

Random.seed!(2)
myVector=Vector{Mine}()
for i=1:1000000
    push!(myVector,Mine(rand(1:10),rand(),rand(2,2)))
end

function flatten_to_DF(x)
    df=DataFrame()
    for h in ["a","b"]
        hSymbol=Symbol(h)
        df[hSymbol]=map(z->getfield(z,hSymbol),x)
    end   
    return df
end

function flatten_to_DF2(x)
    df=DataFrame()
    sz=length(x)
    df=DataFrame(a=zeros(typeof(x[1].a),sz), b=zeros(typeof(x[1].b),sz))
    @inbounds for i=1:sz
        df[i,:a]=x[i].a
        df[i,:b]=x[i].b
    end
    return df
end

function flatten_to_DF3(x)
    df=DataFrame()
    for h in ["a","b"]
        hSymbol=Symbol(h)
        v0=map(z->getfield(z,hSymbol),x)
        df[hSymbol]=v0
    end   
    return df
end

@btime d=flatten_to_DF(myVector) #36ms 30MB
@btime d=flatten_to_DF2(myVector) #82ms 60MB
@btime d=flatten_to_DF3(myVector) #36ms 30MB

In about one hour, one hour and half, I return to you, I need to get my lunch on the university’s cafeteria.

See about the copycols here.

How’s this?

using DataFrames, Random

struct Mine
    a::Int 
    b::Float64
    c::Matrix{Float64}
end   

 
Random.seed!(999)  # this is just my seed number that I've always used.  it's tradition

v = [Mine(rand(1:10), rand(), rand(2,2)) for i ∈ 1:10^6]

function flatten(::Type{DataFrame}, v::AbstractVector{<:Mine})
    nt = (a=map(x -> x.a, v), b=map(x -> x.b, v))
    DataFrame(nt, copycols=false)      
end

df = flatten(DataFrame, v)

This method copies the data. The only way around copying here is some kind of view, because your data already belongs to your Mine structs which are not arranged as a set of Arrays. The only way to do it without copying would be some sort of view. For example

using LazyArrays

function flatten_nocopy(::Type{DataFrame}, v::AbstractVector{<:Mine})
    nt = (a=BroadcastArray((x -> x.a), v), b=BroadcastArray((x -> x.b), v))
    DataFrame(nt, copycols=false)
end

The construction of this is indeed way faster

julia> @btime flatten_nocopy(DataFrame, v);
  1.262 μs (35 allocations: 2.03 KiB)

julia> @btime flatten(DataFrame, v);
  9.599 ms (39 allocations: 30.52 MiB)

I was going to say that access times on the uncopied version might be slightly slower, but I just tested it and nope, they are equivalent:

julia> df = flatten(DataFrame, v);

julia> dfc = flatten_nocopy(DataFrame, v);

julia> @btime df[1, :a];
  51.861 ns (1 allocation: 16 bytes)

julia> @btime dfc[2, :a];
  52.284 ns (1 allocation: 16 bytes)

Nice job by the Julia compiler people, as usual.

2 Likes

Seems like you already took my job at implementing it, XD.

I just do not understand why did you not use copycols in the first excerpt (each map returned a Vector that not referred outside of the method and will be garbage collected, no? it could instead be passed to the DataFrame with copycols = false), and why you receive a type T but does not use it anywhere in your method.

The LazyArrays package is interesting, thanks for mentioning it.

Ugh, these were both silly mistakes I made because I was fast and careless. Changes above. Sorry.

I actually messed something else up as well: because of copycols (which is the thing I kept forgetting about because it used to be default), you can’t quite do what I was suggesting generically. You’d need a special method for DataFrame to ensure that copycols=false. So, it’s a little less generic than I was hoping for.

My last addition to the bunch:

struct Mine
    a::Int 
    b::Float64
    c::Matrix{Float64}
end   
 
using Random

Random.seed!(999)  # this is just my seed number that I've always used.  it's tradition

v = [Mine(rand(1:10), rand(), rand(2,2)) for i ∈ 1:10^6]

using DataFrames

function flatten_maps(v::AbstractVector{<:Mine})
    nt = (a=map(x -> x.a, v), b=map(x -> x.b, v))
    DataFrame(nt, copycols=false)      
end

function flatten_loop(v)
    vl = length(v)
    as = Vector{Int}(undef, vl)
    bs = Vector{Float64}(undef, vl)
    for i = 1:vl
      as[i] = v[i].a
      bs[i] = v[i].b
    end
    DataFrame((a = as, b = bs); copycols = false)      
end

using LazyArrays

function flatten_lazy(v::AbstractVector{<:Mine})
    nt = (a=BroadcastArray((x -> x.a), v), b=BroadcastArray((x -> x.b), v))
    DataFrame(nt, copycols=false)
end

using BenchmarkTools

println("flatten_maps")
@btime flatten_maps(v)
println("flatten_loop")
@btime flatten_loop(v)
println("flatten_lazy")
@btime flatten_lazy(v)

If this is run with julia -O3 --check-bounds=no ./test.jl on my machine (an Intel(R) Core™ i7-7700HQ CPU @ 2.80GHz) it gives:

flatten_maps
  8.078 ms (35 allocations: 15.26 MiB)
flatten_loop
  4.565 ms (33 allocations: 15.26 MiB)
flatten_lazy
  1.227 μs (35 allocations: 2.03 KiB)
1 Like

I’m pretty confused as to why it’s so much slower with map, to be honest.

TBH I am not entirely sure. My to-go assumption is that the vector is
already big enough to be influenced by cache, so it is best to iterate
the structs vector a single time and collect the two fields in a
single pass, than iterate it two times.

There is a native cache miss utility in Julia? Or would I need to wrap
each @btime in a script for itself and call perf over it?

@ExpandingMan, @Henrique_Becker thank you both!

Similar to @ExpandingMan’s answer, here’s a version that hooks more directly into the Tables.jl interface, which has pretty well optimized code to do just what you’re asking. In this code, we’re telling Tables.jl/DataFrames.jl that a Vector{Mine} is in fact a “table” and defining the couple of interface methods so that calling DataFrame(v) “just works”.

using DataFrames, Random, Tables, BenchmarkTools

struct Mine
    a::Int
    b::Float64
    c::Matrix{Float64}
end

Tables.istable(::Type{Vector{Mine}}) = true
Tables.rowaccess(::Type{Vector{Mine}}) = true
Tables.rows(x::Vector{Mine}) = x
Tables.schema(x::Vector{Mine}) = Tables.Schema((:a, :b), Tuple{Int, Float64})

Random.seed!(999)

v = [Mine(rand(1:10), rand(), rand(2,2)) for i ∈ 1:10^6];
df = DataFrame(v)

This results in pretty fast code (though @ExpandingMan’s idea of using LazyArrays is also worth considering!).

julia> @btime DataFrame(v);
  2.992 ms (39 allocations: 15.26 MiB)
5 Likes

StructArrays was defined for this. It’s like implementing @jquinn’s example, but without doing the definitions yourself.

using DataFrames, Random, StructArrays

struct Mine
    a::Int 
    b::Float64
    c::Matrix{Float64}
end   

 
Random.seed!(999)  # this is just my seed number that I've always used.  it's tradition

v = [Mine(rand(1:10), rand(), rand(2,2)) for i ∈ 1:10^6]

t = StructArray(v)

df = DataFrame(t)
5 Likes