Replicating a useful Stata Workflow with DataFramesMeta

pdeffebach · January 29, 2018, 2:43pm

One thing that is necessary when working with survey data is being able create new variables by standardizing capping existing variables variables. One workflow might look like this:

sysuse auto 
local variables_to_cap price mpg headroom trunk weight 

foreach variable in `variables_to_cap' {
    gen `variable'_std = `variable' // generate a new variable but with a suffix added
    sum `variable'_std, detail
    replace `variable'_std = r(p99) if `variable'_std > r(p99) & !missing(`variable'_std)
}

Above, the suffixe _std indicates to the user that we are working with the standardized variable. The benefit of the above code is that you can add the suffix extremely easily, and use the exact same code to generate many standardized variables at once (Ideally, this would be done using a function.)

Using DataFrames we can’t do this because the way to refer to an existing variable is with a symbol, while the you create a variable

@transform(df, newCol = :oldCol) # Command(?) = Symbol  works
@transform(df, symbol("newCol") = :oldCol) # error
@transform(df, :newCol) = :oldCol) # error

Is there a way to do this using DataFrames? Are there any proposals in the works to add this kind of functionality?

There may have been a thread about this recently but I couldn’t find it.

nalimilan · January 29, 2018, 3:02pm

Something like this works, but you lose the convenience of @transform:

transform(df; Symbol("newcol") => cos(df[:oldcol]))

You can also use df[Symbol("newcol")] = df[:oldcol] instead of transform (which is in DataFramesMeta, not in DataFrames BTW).

I’m not sure whether this has been discussed in DataFramesMeta yet, but you could file an issue about it, as it should be possible to support this in @transform in some way (maybe it already exists).

pdeffebach · January 29, 2018, 3:44pm

Thanks for the reply.

I think I will open an issue there, Symbol("newcol") is a bit ugly, but that’s fine if we have a leaner way of doing things in DataFramesMeta. Hopefully there is a clever workaround that makes this easy.

X-posted at DataFramesMeta here

bramtayl · January 30, 2018, 4:46am

Dynamically generated names are really hard to deal with. They will break type stability in something like Query and kinda make a mess of macro programming. Take a look at Programming with dplyr for how dplyr attempts to solve this problem. dplyr also has mutate_all or mutate_each.

If you’re not pressed for performance, I think a better way to do this would be to gather the variables, create a new std column, gather the regular and the std column, then spread.

pdeffebach · January 30, 2018, 5:03am

The absolute confusion of quo and !! in dplyr is something I would like to avoid. Stata gets around this by basically not having any other objects besides variables for the one dataset you are allowed to work with at a time, which is obviously limiting in other ways.

I think the main issue is the inconsistency between

@transform(df, newCol = f(:oldCol))

and

df[:newCol] = f(df[:oldCol]).

With DataFrames, you could have a vector of symbols [:newVar1, :newVar2, ...] and use that to create a bunch of new variables. There is no equivalent way to rename variables with DataFramesMeta. So that might be a task that takes precedence over any suffix commands.

mwsohn · January 30, 2018, 6:44am

How about this? This parallels the stata code closely.


variables_to_cap = [:price, :mpg, :headroom, :trunk, :weight]

for variable in variables_to_cap
    newvar = Symbol(string(variable,"_std"))
    q = quantile(df[variable],.99)
    df[newvar] = [!isna(x) && x > q ? q : x for x in df[variable]]
end

pdeffebach · January 30, 2018, 2:07pm

I like this and it works, but it would be cool to do it all inside the DataFramesMeta ecosystem. I should have clarified in my title and original post.

Topic		Replies	Views
DataFrame column names with symbols General Usage dataframes	7	1163	July 13, 2021
Selecting DataFrame columns with Symbol Syntax for columns with non-alphabetical characters General Usage dataframes	7	3217	October 27, 2021
Standardize all columns of DataFrame Data question , package	9	2752	October 2, 2023
Prefixing output columns which are returned as Dataframe Data dataframes , dataframemacros	9	409	April 30, 2023
Using current variable values in the naming of DataFrame column(s) General Usage dataframes	5	739	September 30, 2021

Replicating a useful Stata Workflow with DataFramesMeta

Related topics