Recommended way to select multiple DataFrame columns parametrically, i.e. not by the names

MK-1138 · January 7, 2022, 11:35pm

I’m learning from the wonderful Julia Data Science book. In chapter 4.7 Variable Transformations we have the following DataFrame example:

function grades_2020()
    name = ["Sally", "Bob", "Alice", "Hank"]
    grade_2020 = [1, 5, 8.5, 4]
    DataFrame(; name, grade_2020)
end

A transformation function is provided as:

plus_one(grades) = grades .+ 1

and several examples of the transformations have been provided as well:

1.

transform(grades_2020(), :grade_2020 => plus_one => :grade_2020)

2.

select(grades_2020(), :, :grade_2020 => plus_one => :grade_2020)

3.

df = grades_2020()
df.grade_2020 = plus_one.(df.grade_2020)

As you can see, in all three examples columns/vectors are selected by explicitly naming the column of interest. This is trivial as there is only one Float64 column. But what if I had 50, 100, 1000 columns?

The third example allows me to do something like:

3a.

df = grades_2020_to_2050()
df[!, 2:50] = plus_one.(df[!, 2:50])

However, the book advises against this:

But, although the last example is easier since it builds on more native Julia operations, we strongly advise to use the functions provided by DataFrames.jl in most cases because they are more capable and easier to work with.

Now, I’ve been reading the docs on how to take a DataFrame subset, and I’m trying to apply it to examples 1 and 2 above but I cannot make it work. Typical error being method error, no method matching getindex.

Which method should I use to achieve the selection as per my example 3a., if I’m using DataFrame functions/methods? Also, is there a way to specify random, i.e. non-sequential columns, when doing this? Say, instead of columns 2 to 50, columns 3 to 5 then 7 to 12 then 24 to 37 instead?

Thanks in advance.

pdeffebach · January 8, 2022, 1:07am

The getindex errors in transform look like a bug, paging @bkamins .

julia> df = DataFrame(a = [1, 2]);

julia> transform(df, :a => (identity => 1))
ERROR: ArgumentError: Unrecognized column selector: :a => (identity => 1)

Issue submitted here.

But in general you can do

julia> df = grades_2020();

julia> df[:, 2] = plus_one(df[:, 2])
4-element Vector{Float64}:
 2.0
 6.0
 9.5
 5.0

julia> df
4×2 DataFrame
 Row │ name    grade_2020
     │ String  Float64
─────┼────────────────────
   1 │ Sally          2.0
   2 │ Bob            6.0
   3 │ Alice          9.5
   4 │ Hank           5.0

To get around the current but in transform you can do

julia> df = grades_2020();

julia> transform(df, 2 => plus_one => names(df, 2))
4×2 DataFrame
 Row │ name    grade_2020
     │ String  Float64
─────┼────────────────────
   1 │ Sally          2.0
   2 │ Bob            6.0
   3 │ Alice          9.5
   4 │ Hank           5.0

For multiple columns you can broadcast the transformation

julia> df = DataFrame(rand(10, 10), :auto);

julia> transform(df, 4:10 .=> plus_one .=> names(df, 4:10))
10×10 DataFrame
 Row │ x1         x2        x3         x4       x5       x6       x7       x8  ⋯
     │ Float64    Float64   Float64    Float64  Float64  Float64  Float64  Flo ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 0.701907   0.977639  0.292729   1.88869  1.40715  1.64455  1.13021  1.5 ⋯
   2 │ 0.380664   0.775745  0.589414   1.02956  1.78343  1.40694  1.06833  1.9
   3 │ 0.0284308  0.481767  0.546112   1.20498  1.64492  1.54639  1.97884  1.4
   4 │ 0.318562   0.193388  0.292798   1.39104  1.3055   1.9848   1.5919   1.6
   5 │ 0.782643   0.259634  0.777806   1.37928  1.36063  1.23323  1.03593  1.7 ⋯
   6 │ 0.18432    0.232746  0.0598277  1.74053  1.00608  1.35316  1.86141  1.7
   7 │ 0.0807145  0.744705  0.228719   1.79577  1.44822  1.34114  1.38837  1.7
   8 │ 0.939477   0.129019  0.014056   1.65755  1.90936  1.58259  1.00072  1.2
   9 │ 0.847095   0.517555  0.851813   1.19816  1.79846  1.33998  1.20017  1.8 ⋯
  10 │ 0.640063   0.415859  0.663479   1.00018  1.21963  1.66447  1.291    1.4
                                                               3 columns omitted

MK-1138 · January 8, 2022, 2:12am

Thank you for your reply @pdeffebach.

I keep getting the error:

MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.

Even when trying:

names(df, 4:10)

None of the examples from the Working with column names docs work for me. I need to check what’s going on with my DataFrame package.

pdeffebach · January 8, 2022, 4:25am

You probably have a variable called names = names(df) somewhere.

bkamins · January 8, 2022, 8:45am

using names in such transforms is almost never needed. You can do just:

julia> df = DataFrame(reshape(1:30, 3, 10), :auto)
3×10 DataFrame
 Row │ x1     x2     x3     x4     x5     x6     x7     x8     x9     x10
     │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────
   1 │     1      4      7     10     13     16     19     22     25     28
   2 │     2      5      8     11     14     17     20     23     26     29
   3 │     3      6      9     12     15     18     21     24     27     30

julia> transform(df, 4:10 .=> plus_one, renamecols=false) # just say you do want to keep old column names
3×10 DataFrame
 Row │ x1     x2     x3     x4     x5     x6     x7     x8     x9     x10
     │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────
   1 │     1      4      7     11     14     17     20     23     26     29
   2 │     2      5      8     12     15     18     21     24     27     30
   3 │     3      6      9     13     16     19     22     25     28     31

julia> transform(df, 4:10 .=> plus_one .=> identity) # use identity function as transformation of source column names
3×10 DataFrame
 Row │ x1     x2     x3     x4     x5     x6     x7     x8     x9     x10
     │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────
   1 │     1      4      7     11     14     17     20     23     26     29
   2 │     2      5      8     12     15     18     21     24     27     30
   3 │     3      6      9     13     16     19     22     25     28     31

MK-1138 · January 8, 2022, 3:07pm

This worked like a charm. Thank you so much. Also, thank you for the tip on the identity function, I was about to ask that!

May I ask only one more thing - is there a way to break the grouping of columns 4:10 when passing them as an argument, and select random column ranges such as 4:5, 6:8 and then 2:3? I mean, within the same command?

bkamins · January 8, 2022, 4:12pm

I assume you do not want random values (you could do this also) but these specific values. If yes then write:

[4:5; 6:8; 2:3]

MK-1138 · January 8, 2022, 6:32pm

So it’s the semicolons.

Thank you one more time, I do appreciate it.

pdeffebach · January 8, 2022, 6:34pm

The semicolons here simply mean vcat, whereas if you had written [4:5, 6:8, 2:3] you would have gotten a vector of ranges.

MK-1138 · January 8, 2022, 7:39pm

Thanks for that. I’m reading on Base.vcat now.

This is what I tried, as I expected julia to go through a vector of ranges (or a range of vectors, if that makes sense). From the vcat docs it seems that the approach @bkamins proposed is actually to concatenate arrays along a dimension. That’s interesting. Obviously my understanding of the matter is still very basic and I need to keep working on this. Cheers.

Topic		Replies	Views
Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language New to Julia question , dataframes	22	702	December 23, 2022
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10570	April 22, 2022
How to correctly create ''slices'' of a column by using indexes of another one, in a DataFrame New to Julia dataframes	7	616	November 17, 2022
Specific columns in dataframe transform General Usage dataframes	6	376	June 2, 2022
Selectively transform columns from a list of string names? Not sure why this doesn't work New to Julia data , dataframes	10	320	March 4, 2023

Recommended way to select multiple DataFrame columns parametrically, i.e. not by the names

Related topics