DataFrame Matrix Constructor without Copying

I’m curious why the following method for creating a DataFrame using an existing matrix doesn’t exist. In some ways it seems like the simplest way to create a DataFrame.

using DataFrames
mat = rand(10000,3)
df = DataFrame(mat, [:a, :b, :c], copycols=false)

I have a large matrix and I’d like to use a DataFrame to label and operate on the columns inplace - either to avoid an expensive copy or just to make modifying the original matrix more convenient through the DataFrame API.

Btw, the following workaround seems to save the copy, but I’m wondering if there’s something I’m missing:

matcols = collect(eachcol(mat))
df = DataFrame(matcols, [:a, :b, :c], copycols=false);

In Julia a Matrix is not just a vector of vectors. They have different memory layouts, which means you have to copy.

This is actually working in the development version, it will be released in DataFrames 1.3…

@pdeffebach I think the memory layout is not a problem? Since Matrix is column-major, it is straightforward to have vectors that are views of the same data as the matrix.

2 Likes

This will work in DataFrames 1.3 as @sudete explained.

However, one has to bear in mind that such an approach ha a limitation that many standard functions like push! or append! will not work correctly with such a data frame.

4 Likes

This is great! Thanks @bkamins and @sudete. The memory issue didn’t seem like it would be a problem since a dense Matrix is just a more memory layout constrained vector of vectors.

push! and append! are the sort of gotchas I was trying to think of with my workaround above. Not that I would use them in my use case, but I see how that could create major headaches in general. Could those two particular functions dispatch to vcat for the Matrix case?

No, as it would break the contract for push! and append! in Julia Base.

4 Likes

Why wouldn’t push! and append! work properly?

Because we would create a new vector and push! and append! are functions that update existing vectors in-place.