Proposal: Using a "mask" for DataFrames and ModelMatrices

proposal

#1

In data science we always have a need to split data into a training set and a test set. Sometimes we will do this multiple times like in k fold cross validation, or leave one out cross validations.

In R: the typical paradigm is

train.rows = sample(1 : nrow(df), 0.8 * nrow(df), replace = F)
train_df = df[train.rows,]
test_df = df[-train.rows,]

The process of subsetting the rows actually allocates a new dataframe. Ideally there would have been a way to flag the testing rows as “hidden” with a single bit. Then the dataframe could be passed to a modeling function like glm, without any extra allocations.

Other than cross validation, this could be really useful for general dataframe operations.

Perhaps one way we could implement this is for every dataframe type to carry around a weight integer vector. If the row is hidden, the weight would be 0, and otherwise 1. This might work well with other implementations in stats like weighted means and variances. If the weighted mean function sees a weight of 0, it would disregard that row anyway, which would be equivalent to the mask


#2

Something along the lines of array views would solve the issue without the need for carrying a bit for every row of the frame:

A = zeros(1000,1000) # big matrix
rows = view(A, 1:500, :) # view first 500 rows without making copies

@show sizeof(A) # 8000000
@show sizeof(rows) # 48

I don’t know if there is an equivalent for DataFrames already implemented though.


#3

Yes, views are implemented for DataFrames. In fact, all internal split-apply-combine methods for DataFrames use views. You can simply do

trainrows = [rand()<0.8 for i in eachrow(df)]
traindf = @view df[trainrows]     
testdf = view(df, .!(trainrows))

The macro- and function- syntax for specifying views are equivalent.


#4

Wow Julia is so amazing! I’m looking for view for ModelMatrix, but I can’t seem to find it