Selectively transform columns from a list of string names? Not sure why this doesn't work

Olaf · March 3, 2023, 3:23pm

Essentially, I have a very large DataFrame df. In that DataFrame there’s a column called “Total Population”. I want to put the data from df into a dataframe df_per_capita, which divides every the cells in every column by their corresponding row value in “Total Population”.

However, I have a list of column names stored as a vector of strings non_pop_cols. These are the names of columns that would be completely meaningless if divided by population (e.g. values that are already given per capita, or GINI coefficient). I initially tried to do this:

df_per_capita = transform(df, [Not(non_pop_cols), :"Total Population"] => ByRow((x1,x2) -> x1 / x2))

But I got the error

``ERROR: ArgumentError: idxs[1] has type InvertedIndex{Vector{String}}; only Integer, Symbol, or string values allowed when indexing by vector.

It sounds like it wants an iterator here, and I’ve tried foreach(Not(non_pop_cols)...), which doesn’t seem to work. I tried flattening the vector with vcat(Not(non_pop_cols)...), which generates an error “no method matching iterate(::InvertedIndex{Vector{String}})”.

I think (tell me if I’m wrong) the problem is that I’m trying to pass a string vector as a single x1, when in fact it’s an entire vector of strings. …but I’m not 100% sure how to fix that. Relatively new to Julia (started learning ~3 weeks ago).

I’ve now spent an embarrassingly long time reading discourse posts and trying to tweak this to figure out why I can’t get this line to work.

Could anyone help me figure out how to get this to work?

Bear in mind I’m using transform rather than select because, although I don’t want to divide the non_pop_cols in df_per_capita by "Total Population", I do still want them in the DataFrame.

nilshg · March 3, 2023, 3:44pm

There’s no requirement to use the minilanguage for everything, just:

for c ∈ names(df[!, Not(non_pop_cols)])
    df[!, c] ./= df."Total Population"
end

should do the trick.

pdeffebach · March 3, 2023, 3:51pm

You are missing a broadcasting with the .=>. You want something like

[[c, "Total Population"] for c in cols(df, Not(non_pop_cols))] .=> ByRow(/)

nilshg · March 3, 2023, 3:55pm

Although even when the broadcasting is fixed this [Not(non_pop_cols), :"Total Population"] just wouldn’t work I believe - to me this is a good example of a case where of course it’s possible to do it in the minilanguage (and Bogumil might come by in a bit to show how), but it’s just not worth the hassle and will probably produce pretty unreadable code, whereas a simple loop takes 3 seconds to think up and will be easy to understand if others (or your future self) looks at the code later.

pdeffebach · March 3, 2023, 4:02pm

Agreed. To plug DataFramesMeta as well

for c in names(df, Not(non_pop_cols))
    @rtransform! df $c = $c / $"Total Population"
end

Olaf · March 3, 2023, 4:22pm

Hm, I copy pasted this and got an error:
“syntax: invalid assignment location “$c” around /home/user/.julia/packages/DataFramesMeta/hLirN/src/macros.jl:1776”

What’s the syntax error here?

EDIT: Line 1776 is this bit:


macro transform!(x, args...)
    esc(transform!_helper(x, args...))
end

function rtransform!_helper(x, args...)
    x, exprs, outer_flags, kw = get_df_args_kwargs(x, args...; wrap_byrow = true)

    t = (fun_to_vec(ex; gensym_names=false, outer_flags=outer_flags) for ex in exprs)
    quote
        $transform!($x, $(t...); $(kw...))
    end
end

"""
    @rtransform!(x, args...; kwargs...)

Row-wise version of `@transform!`, i.e. all operations use `@byrow` by
default. See [`@transform!`](@ref) for details."""
macro rtransform!(x, args...)  ###   <-  This is the line
    esc(rtransform!_helper(x, args...))  
end

pdeffebach · March 3, 2023, 4:25pm

Sorry, I forgot the df. I’ve updated the post above.

bkamins · March 3, 2023, 5:23pm

To keep things simple you could write it as:

transform(df, Not(non_pop_cols) .=> (x -> x ./ df."Total Population") .=> n -> n * "_normed")

(I added auto-generation of target column name but you could keep old names alternatively)

rocco_sprmnt21 · March 3, 2023, 6:31pm

To do the operations you need for each row, you would have to change the order in which you pass the list of columns to the function. First the :tot column and after all the others this way using slurping you can do the division.
An alternative way is to use AsTable which passes a NamedTuple.

using DataFrames
df=DataFrame(rand(1000:2000,100,3),:auto)

df.tot=rand(3000:3500, 100)
df.y1=rand('a':'z', 100)
df.y2=rand('a':'z', 100)


transform(df, Cols(:tot,Not(Cols(r"y",:tot)))=>ByRow((x, y...)->y./x)=>x->names(df,r"x"))

To do the operations you need for each row, you would have to change the order in which you pass the list of columns to the function. First the :tot column and after all the others this way using slurping you can do the division.
An alternative way is to use AsTable which passes a NamedTuple.

I don’t know how your columns are arranged, but I’ve tried to make the selection as generic as possible. It is an exercise to test these syntaxes

PS
I don’t know if in the latest version of Julia it is possible to slurp even in positions other than the last one.

rocco_sprmnt21 · March 4, 2023, 9:09am

If the Cols function guaranteed the order of the resulting union, it could be abbreviated like this

transform(df, Cols(:tot,Not(r"y"))=>ByRow((x, y...)->y./x)=>names(df,r"x"))

To put it more clearly, I suspected that the expression Not(non_pop_cols) also contains the column :tot.
Hence my doubt that Cols(:tot, Not(non_pop_cols)) can keep :tot in the position it has inside Not(non_pop_cols) and not in the first position, as needed by the rest of the expression.
But at this point the question arises whether even selecting the columns with Cols(:tot,Not(Cols(non_pop_cols,:tot))), in the resulting list :tot ends up in a different position from the first, which it is precisely the one that guarantees the sense of the rest of the expression (y,x...)-> y./x

bkamins · March 4, 2023, 10:07am

The order is guaranteed, unless you use operator kwarg that does not guarantee it.

Topic		Replies	Views
Conditional selection of columns in for loop New to Julia	4	373	March 16, 2020
Transforming DataFrame from column of vector General Usage dataframes	3	245	December 1, 2022
Recommended way to select multiple DataFrame columns parametrically, i.e. not by the names New to Julia dataframes	9	2368	January 8, 2022
DataFrame vector to columns New to Julia dataframes	2	893	December 29, 2020
Having problems to modifying a DataFrame using a for loop and eachcol New to Julia question	2	335	June 21, 2022

Selectively transform columns from a list of string names? Not sure why this doesn't work

Related topics