Selectively transform columns from a list of string names? Not sure why this doesn't work

Essentially, I have a very large DataFrame df. In that DataFrame there’s a column called “Total Population”. I want to put the data from df into a dataframe df_per_capita, which divides every the cells in every column by their corresponding row value in “Total Population”.

However, I have a list of column names stored as a vector of strings non_pop_cols. These are the names of columns that would be completely meaningless if divided by population (e.g. values that are already given per capita, or GINI coefficient). I initially tried to do this:

df_per_capita = transform(df, [Not(non_pop_cols), :"Total Population"] => ByRow((x1,x2) -> x1 / x2))

But I got the error

``ERROR: ArgumentError: idxs[1] has type InvertedIndex{Vector{String}}; only Integer, Symbol, or string values allowed when indexing by vector.

It sounds like it wants an iterator here, and I’ve tried foreach(Not(non_pop_cols)...), which doesn’t seem to work. I tried flattening the vector with vcat(Not(non_pop_cols)...), which generates an error “no method matching iterate(::InvertedIndex{Vector{String}})”.

I think (tell me if I’m wrong) the problem is that I’m trying to pass a string vector as a single x1, when in fact it’s an entire vector of strings. …but I’m not 100% sure how to fix that. Relatively new to Julia (started learning ~3 weeks ago).

I’ve now spent an embarrassingly long time reading discourse posts and trying to tweak this to figure out why I can’t get this line to work.

Could anyone help me figure out how to get this to work?

Bear in mind I’m using transform rather than select because, although I don’t want to divide the non_pop_cols in df_per_capita by "Total Population", I do still want them in the DataFrame.

There’s no requirement to use the minilanguage for everything, just:

for c ∈ names(df[!, Not(non_pop_cols)])
    df[!, c] ./= df."Total Population"
end

should do the trick.

You are missing a broadcasting with the .=>. You want something like

[[c, "Total Population"] for c in cols(df, Not(non_pop_cols))] .=> ByRow(/)

Although even when the broadcasting is fixed this [Not(non_pop_cols), :"Total Population"] just wouldn’t work I believe - to me this is a good example of a case where of course it’s possible to do it in the minilanguage (and Bogumil might come by in a bit to show how), but it’s just not worth the hassle and will probably produce pretty unreadable code, whereas a simple loop takes 3 seconds to think up and will be easy to understand if others (or your future self) looks at the code later.

1 Like

Agreed. To plug DataFramesMeta as well

for c in names(df, Not(non_pop_cols))
    @rtransform! df $c = $c / $"Total Population"
end

Hm, I copy pasted this and got an error:
“syntax: invalid assignment location “$c” around /home/user/.julia/packages/DataFramesMeta/hLirN/src/macros.jl:1776”

What’s the syntax error here?

EDIT: Line 1776 is this bit:


macro transform!(x, args...)
    esc(transform!_helper(x, args...))
end

function rtransform!_helper(x, args...)
    x, exprs, outer_flags, kw = get_df_args_kwargs(x, args...; wrap_byrow = true)

    t = (fun_to_vec(ex; gensym_names=false, outer_flags=outer_flags) for ex in exprs)
    quote
        $transform!($x, $(t...); $(kw...))
    end
end

"""
    @rtransform!(x, args...; kwargs...)

Row-wise version of `@transform!`, i.e. all operations use `@byrow` by
default. See [`@transform!`](@ref) for details."""
macro rtransform!(x, args...)  ###   <-  This is the line
    esc(rtransform!_helper(x, args...))  
end

Sorry, I forgot the df. I’ve updated the post above.

To keep things simple you could write it as:

transform(df, Not(non_pop_cols) .=> (x -> x ./ df."Total Population") .=> n -> n * "_normed")

(I added auto-generation of target column name but you could keep old names alternatively)

To do the operations you need for each row, you would have to change the order in which you pass the list of columns to the function. First the :tot column and after all the others this way using slurping you can do the division.
An alternative way is to use AsTable which passes a NamedTuple.

using DataFrames
df=DataFrame(rand(1000:2000,100,3),:auto)

df.tot=rand(3000:3500, 100)
df.y1=rand('a':'z', 100)
df.y2=rand('a':'z', 100)


transform(df, Cols(:tot,Not(Cols(r"y",:tot)))=>ByRow((x, y...)->y./x)=>x->names(df,r"x"))

To do the operations you need for each row, you would have to change the order in which you pass the list of columns to the function. First the :tot column and after all the others this way using slurping you can do the division.
An alternative way is to use AsTable which passes a NamedTuple.

I don’t know how your columns are arranged, but I’ve tried to make the selection as generic as possible. It is an exercise to test these syntaxes :smile:

PS
I don’t know if in the latest version of Julia it is possible to slurp even in positions other than the last one.

If the Cols function guaranteed the order of the resulting union, it could be abbreviated like this

transform(df, Cols(:tot,Not(r"y"))=>ByRow((x, y...)->y./x)=>names(df,r"x"))

To put it more clearly, I suspected that the expression Not(non_pop_cols) also contains the column :tot.
Hence my doubt that Cols(:tot, Not(non_pop_cols)) can keep :tot in the position it has inside Not(non_pop_cols) and not in the first position, as needed by the rest of the expression.
But at this point the question arises whether even selecting the columns with Cols(:tot,Not(Cols(non_pop_cols,:tot))), in the resulting list :tot ends up in a different position from the first, which it is precisely the one that guarantees the sense of the rest of the expression (y,x...)-> y./x

The order is guaranteed, unless you use operator kwarg that does not guarantee it.