Convert symbol to expression

I’m trying to complete a DataFrame that omits some years with the average of adjacent years. For example:

using DataFrames

df = DataFrame(year = [1995, 1996, 1997, 1999, 2001], x = float(1:5))

From that, I am trying to obtain:

DataFrame(year = 1995:2001, x = [1, 2, 3, 3.5, 4, 4.5, 5])

This is what I’ve done:

using StatsBase

function tsfill!(df, t=:year, xfill=:x)
    maxy = maximum(df.t)
    miny = minimum(df.t)
    completeyears = miny:maxy
    yearstofill = filter(x -> x ∉ df.t, completeyears)
    for y in yearstofill
        sub = subset(df, t => yr -> yr .== y - 1 .|| yr .== y + 1)
        append!(df, DataFrame(t = y, xfill = mean(sub[:, xfill])))
    end
    sort!(df, t)
    return df
end

That function errors in the expression DataFrame(t = y, xfill = mean(sub[:, xfill])) since t and xfill are Symbols and DataFrame expects a normal expression (ie year = y and x = mean(...). How can I convert from Symbol to “normal expression”?

You don’t have to, just do DataFrame(t => y) - the constructor called with a pair accepts a string/symbol on the left hand side of the pair.

1 Like

Is there a package for doing these kind of “administrative tasks” for DataFrames?

Not that I’m aware of, although I’m not entirely sure what your definition of “administrative task” is - what you’re doing here is a specific imputation scheme, for which there are packages like Impute.jl, which has substitute

https://invenia.github.io/Impute.jl/stable/api/imputation/#Impute.substitute

as well as a k-nearest neighbour imputation scheme - either of those might be amenable to what you’re doing here although I haven’t tried.

1 Like

I mean something like data management for DataFrames. I find the tools provided in the default package to be too low level, so I usually do any data management in Stata before bringing it to Julia.

Again very hard to say without a better idea of what you mean by *data management" - I’d say there’s very little that you can do in Stata that you can’t do in DataFrames (mainly manipulations of panel data which leverage Stata’s ability to set an id and time dimension, although I haven’t used Stata seriously in about five years), while at the same time there’s lots of stuff you can do on dataframes that you would struggle to do in stata (without resorting to using Mata), simply because you have all the power and expressivity of base Julia at your disposal.

If it’s just about dataframes being more verbose than stata (because you have to write df[df.col1 .> 1, :] etc you might want to look into DataFramesMeta.

Other than that I don’t think where are packages for what I world consider “data management” (ie filtering, transforming, aggregating data) of DataFrames as DataFrames is the package designed for this already.

So the most useful thing from your perspective is probably to ask here for solutions to specific “data management” tasks which you think can’t be done in DataFrames.jl

1 Like