Proposal: Using a "mask" for DataFrames and ModelMatrices

jeffwong · September 8, 2017, 2:37am

In data science we always have a need to split data into a training set and a test set. Sometimes we will do this multiple times like in k fold cross validation, or leave one out cross validations.

In R: the typical paradigm is

train.rows = sample(1 : nrow(df), 0.8 * nrow(df), replace = F)
train_df = df[train.rows,]
test_df = df[-train.rows,]

The process of subsetting the rows actually allocates a new dataframe. Ideally there would have been a way to flag the testing rows as “hidden” with a single bit. Then the dataframe could be passed to a modeling function like glm, without any extra allocations.

Other than cross validation, this could be really useful for general dataframe operations.

Perhaps one way we could implement this is for every dataframe type to carry around a weight integer vector. If the row is hidden, the weight would be 0, and otherwise 1. This might work well with other implementations in stats like weighted means and variances. If the weighted mean function sees a weight of 0, it would disregard that row anyway, which would be equivalent to the mask

juliohm · September 8, 2017, 3:42am

Something along the lines of array views would solve the issue without the need for carrying a bit for every row of the frame:

A = zeros(1000,1000) # big matrix
rows = view(A, 1:500, :) # view first 500 rows without making copies

@show sizeof(A) # 8000000
@show sizeof(rows) # 48

I don’t know if there is an equivalent for DataFrames already implemented though.

mkborregaard · September 8, 2017, 8:30am

Yes, views are implemented for DataFrames. In fact, all internal split-apply-combine methods for DataFrames use views. You can simply do

trainrows = [rand()<0.8 for i in eachrow(df)]
traindf = @view df[trainrows]     
testdf = view(df, .!(trainrows))

The macro- and function- syntax for specifying views are equivalent.

jeffwong · September 8, 2017, 11:46pm

Wow Julia is so amazing! I’m looking for view for ModelMatrix, but I can’t seem to find it

Topic		Replies	Views
Simple tool for train test split Machine Learning	12	11731	March 20, 2020
View and @view in DataFrames Data dataframes , views	2	1452	February 17, 2022
Does indexing over a subset of rows in a DataFrame return a view or a copy? New to Julia	5	715	September 5, 2020
Performance of DataFrames' subset and ByRow Performance	10	1542	May 5, 2021
(DataFrames.jl Suggestion) A (public) function that takes the same args as `subset` and returns the matched indices Data suggestions , dataframes	1	265	December 6, 2022

Proposal: Using a "mask" for DataFrames and ModelMatrices

Related topics