One thing that is necessary when working with survey data is being able create new variables by standardizing capping existing variables variables. One workflow might look like this:
sysuse auto
local variables_to_cap price mpg headroom trunk weight
foreach variable in `variables_to_cap' {
gen `variable'_std = `variable' // generate a new variable but with a suffix added
sum `variable'_std, detail
replace `variable'_std = r(p99) if `variable'_std > r(p99) & !missing(`variable'_std)
}
Above, the suffixe _std indicates to the user that we are working with the standardized variable. The benefit of the above code is that you can add the suffix extremely easily, and use the exact same code to generate many standardized variables at once (Ideally, this would be done using a function.)
Using DataFrames we can’t do this because the way to refer to an existing variable is with a symbol, while the you create a variable
You can also use df[Symbol("newcol")] = df[:oldcol] instead of transform (which is in DataFramesMeta, not in DataFrames BTW).
I’m not sure whether this has been discussed in DataFramesMeta yet, but you could file an issue about it, as it should be possible to support this in @transform in some way (maybe it already exists).
I think I will open an issue there, Symbol("newcol") is a bit ugly, but that’s fine if we have a leaner way of doing things in DataFramesMeta. Hopefully there is a clever workaround that makes this easy.
Dynamically generated names are really hard to deal with. They will break type stability in something like Query and kinda make a mess of macro programming. Take a look at Programming with dplyr for how dplyr attempts to solve this problem. dplyr also has mutate_all or mutate_each.
If you’re not pressed for performance, I think a better way to do this would be to gather the variables, create a new std column, gather the regular and the std column, then spread.
The absolute confusion of quo and !! in dplyr is something I would like to avoid. Stata gets around this by basically not having any other objects besides variables for the one dataset you are allowed to work with at a time, which is obviously limiting in other ways.
I think the main issue is the inconsistency between
@transform(df, newCol = f(:oldCol))
and
df[:newCol] = f(df[:oldCol]).
With DataFrames, you could have a vector of symbols [:newVar1, :newVar2, ...] and use that to create a bunch of new variables. There is no equivalent way to rename variables with DataFramesMeta. So that might be a task that takes precedence over any suffix commands.
How about this? This parallels the stata code closely.
variables_to_cap = [:price, :mpg, :headroom, :trunk, :weight]
for variable in variables_to_cap
newvar = Symbol(string(variable,"_std"))
q = quantile(df[variable],.99)
df[newvar] = [!isna(x) && x > q ? q : x for x in df[variable]]
end
I like this and it works, but it would be cool to do it all inside the DataFramesMeta ecosystem. I should have clarified in my title and original post.