Replicating a useful Stata Workflow with DataFramesMeta

dataframes

#1

One thing that is necessary when working with survey data is being able create new variables by standardizing capping existing variables variables. One workflow might look like this:

sysuse auto 
local variables_to_cap price mpg headroom trunk weight 

foreach variable in `variables_to_cap' {
    gen `variable'_std = `variable' // generate a new variable but with a suffix added
    sum `variable'_std, detail
    replace `variable'_std = r(p99) if `variable'_std > r(p99) & !missing(`variable'_std)
}

Above, the suffixe _std indicates to the user that we are working with the standardized variable. The benefit of the above code is that you can add the suffix extremely easily, and use the exact same code to generate many standardized variables at once (Ideally, this would be done using a function.)

Using DataFrames we can’t do this because the way to refer to an existing variable is with a symbol, while the you create a variable

@transform(df, newCol = :oldCol) # Command(?) = Symbol  works
@transform(df, symbol("newCol") = :oldCol) # error
@transform(df, :newCol) = :oldCol) # error

Is there a way to do this using DataFrames? Are there any proposals in the works to add this kind of functionality?

There may have been a thread about this recently but I couldn’t find it.


Using DataFramesMeta and Lazy in a function
#2

Something like this works, but you lose the convenience of @transform:

transform(df; Symbol("newcol") => cos(df[:oldcol]))

You can also use df[Symbol("newcol")] = df[:oldcol] instead of transform (which is in DataFramesMeta, not in DataFrames BTW).

I’m not sure whether this has been discussed in DataFramesMeta yet, but you could file an issue about it, as it should be possible to support this in @transform in some way (maybe it already exists).


#3

Thanks for the reply.

I think I will open an issue there, Symbol("newcol") is a bit ugly, but that’s fine if we have a leaner way of doing things in DataFramesMeta. Hopefully there is a clever workaround that makes this easy.

X-posted at DataFramesMeta here


#4

Dynamically generated names are really hard to deal with. They will break type stability in something like Query and kinda make a mess of macro programming. Take a look at https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html for how dplyr attempts to solve this problem. dplyr also has mutate_all or mutate_each.

If you’re not pressed for performance, I think a better way to do this would be to gather the variables, create a new std column, gather the regular and the std column, then spread.


#5

The absolute confusion of quo and !! in dplyr is something I would like to avoid. Stata gets around this by basically not having any other objects besides variables for the one dataset you are allowed to work with at a time, which is obviously limiting in other ways.

I think the main issue is the inconsistency between

@transform(df, newCol = f(:oldCol)) 

and

df[:newCol] = f(df[:oldCol]). 

With DataFrames, you could have a vector of symbols [:newVar1, :newVar2, ...] and use that to create a bunch of new variables. There is no equivalent way to rename variables with DataFramesMeta. So that might be a task that takes precedence over any suffix commands.


#6

How about this? This parallels the stata code closely.


variables_to_cap = [:price, :mpg, :headroom, :trunk, :weight]

for variable in variables_to_cap
    newvar = Symbol(string(variable,"_std"))
    q = quantile(df[variable],.99)
    df[newvar] = [!isna(x) && x > q ? q : x for x in df[variable]]
end


#7

I like this and it works, but it would be cool to do it all inside the DataFramesMeta ecosystem. I should have clarified in my title and original post.