In data science we always have a need to split data into a training set and a test set. Sometimes we will do this multiple times like in k fold cross validation, or leave one out cross validations.
In R: the typical paradigm is
train.rows = sample(1 : nrow(df), 0.8 * nrow(df), replace = F)
train_df = df[train.rows,]
test_df = df[-train.rows,]
The process of subsetting the rows actually allocates a new dataframe. Ideally there would have been a way to flag the testing rows as “hidden” with a single bit. Then the dataframe could be passed to a modeling function like glm, without any extra allocations.
Other than cross validation, this could be really useful for general dataframe operations.
Perhaps one way we could implement this is for every dataframe type to carry around a weight
integer vector. If the row is hidden, the weight would be 0, and otherwise 1. This might work well with other implementations in stats like weighted means and variances. If the weighted mean function sees a weight of 0, it would disregard that row anyway, which would be equivalent to the mask