Recommended way to select multiple DataFrame columns parametrically, i.e. not by the names

I’m learning from the wonderful Julia Data Science book. In chapter 4.7 Variable Transformations we have the following DataFrame example:

function grades_2020()
    name = ["Sally", "Bob", "Alice", "Hank"]
    grade_2020 = [1, 5, 8.5, 4]
    DataFrame(; name, grade_2020)
end

A transformation function is provided as:

plus_one(grades) = grades .+ 1 

and several examples of the transformations have been provided as well:

1.

transform(grades_2020(), :grade_2020 => plus_one => :grade_2020)

2.

select(grades_2020(), :, :grade_2020 => plus_one => :grade_2020)

3.

df = grades_2020()
df.grade_2020 = plus_one.(df.grade_2020)

As you can see, in all three examples columns/vectors are selected by explicitly naming the column of interest. This is trivial as there is only one Float64 column. But what if I had 50, 100, 1000 columns?

The third example allows me to do something like:

3a.

df = grades_2020_to_2050()
df[!, 2:50] = plus_one.(df[!, 2:50])

However, the book advises against this:

But, although the last example is easier since it builds on more native Julia operations, we strongly advise to use the functions provided by DataFrames.jl in most cases because they are more capable and easier to work with.

Now, I’ve been reading the docs on how to take a DataFrame subset, and I’m trying to apply it to examples 1 and 2 above but I cannot make it work. Typical error being method error, no method matching getindex.

Which method should I use to achieve the selection as per my example 3a., if I’m using DataFrame functions/methods? Also, is there a way to specify random, i.e. non-sequential columns, when doing this? Say, instead of columns 2 to 50, columns 3 to 5 then 7 to 12 then 24 to 37 instead?

Thanks in advance.

The getindex errors in transform look like a bug, paging @bkamins .

julia> df = DataFrame(a = [1, 2]);

julia> transform(df, :a => (identity => 1))
ERROR: ArgumentError: Unrecognized column selector: :a => (identity => 1)

Issue submitted here.

But in general you can do

julia> df = grades_2020();

julia> df[:, 2] = plus_one(df[:, 2])
4-element Vector{Float64}:
 2.0
 6.0
 9.5
 5.0

julia> df
4Γ—2 DataFrame
 Row β”‚ name    grade_2020
     β”‚ String  Float64
─────┼────────────────────
   1 β”‚ Sally          2.0
   2 β”‚ Bob            6.0
   3 β”‚ Alice          9.5
   4 β”‚ Hank           5.0

To get around the current but in transform you can do

julia> df = grades_2020();

julia> transform(df, 2 => plus_one => names(df, 2))
4Γ—2 DataFrame
 Row β”‚ name    grade_2020
     β”‚ String  Float64
─────┼────────────────────
   1 β”‚ Sally          2.0
   2 β”‚ Bob            6.0
   3 β”‚ Alice          9.5
   4 β”‚ Hank           5.0

For multiple columns you can broadcast the transformation

julia> df = DataFrame(rand(10, 10), :auto);

julia> transform(df, 4:10 .=> plus_one .=> names(df, 4:10))
10Γ—10 DataFrame
 Row β”‚ x1         x2        x3         x4       x5       x6       x7       x8  β‹―
     β”‚ Float64    Float64   Float64    Float64  Float64  Float64  Float64  Flo β‹―
─────┼──────────────────────────────────────────────────────────────────────────
   1 β”‚ 0.701907   0.977639  0.292729   1.88869  1.40715  1.64455  1.13021  1.5 β‹―
   2 β”‚ 0.380664   0.775745  0.589414   1.02956  1.78343  1.40694  1.06833  1.9
   3 β”‚ 0.0284308  0.481767  0.546112   1.20498  1.64492  1.54639  1.97884  1.4
   4 β”‚ 0.318562   0.193388  0.292798   1.39104  1.3055   1.9848   1.5919   1.6
   5 β”‚ 0.782643   0.259634  0.777806   1.37928  1.36063  1.23323  1.03593  1.7 β‹―
   6 β”‚ 0.18432    0.232746  0.0598277  1.74053  1.00608  1.35316  1.86141  1.7
   7 β”‚ 0.0807145  0.744705  0.228719   1.79577  1.44822  1.34114  1.38837  1.7
   8 β”‚ 0.939477   0.129019  0.014056   1.65755  1.90936  1.58259  1.00072  1.2
   9 β”‚ 0.847095   0.517555  0.851813   1.19816  1.79846  1.33998  1.20017  1.8 β‹―
  10 β”‚ 0.640063   0.415859  0.663479   1.00018  1.21963  1.66447  1.291    1.4
                                                               3 columns omitted
2 Likes

Thank you for your reply @pdeffebach.

I keep getting the error:

MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.

Even when trying:

names(df, 4:10) 

None of the examples from the Working with column names docs work for me. I need to check what’s going on with my DataFrame package.

You probably have a variable called names = names(df) somewhere.

using names in such transforms is almost never needed. You can do just:

julia> df = DataFrame(reshape(1:30, 3, 10), :auto)
3Γ—10 DataFrame
 Row β”‚ x1     x2     x3     x4     x5     x6     x7     x8     x9     x10
     β”‚ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────
   1 β”‚     1      4      7     10     13     16     19     22     25     28
   2 β”‚     2      5      8     11     14     17     20     23     26     29
   3 β”‚     3      6      9     12     15     18     21     24     27     30

julia> transform(df, 4:10 .=> plus_one, renamecols=false) # just say you do want to keep old column names
3Γ—10 DataFrame
 Row β”‚ x1     x2     x3     x4     x5     x6     x7     x8     x9     x10
     β”‚ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────
   1 β”‚     1      4      7     11     14     17     20     23     26     29
   2 β”‚     2      5      8     12     15     18     21     24     27     30
   3 β”‚     3      6      9     13     16     19     22     25     28     31

julia> transform(df, 4:10 .=> plus_one .=> identity) # use identity function as transformation of source column names
3Γ—10 DataFrame
 Row β”‚ x1     x2     x3     x4     x5     x6     x7     x8     x9     x10
     β”‚ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────
   1 β”‚     1      4      7     11     14     17     20     23     26     29
   2 β”‚     2      5      8     12     15     18     21     24     27     30
   3 β”‚     3      6      9     13     16     19     22     25     28     31
4 Likes

This worked like a charm. Thank you so much. Also, thank you for the tip on the identity function, I was about to ask that!

May I ask only one more thing - is there a way to break the grouping of columns 4:10 when passing them as an argument, and select random column ranges such as 4:5, 6:8 and then 2:3? I mean, within the same command?

I assume you do not want random values (you could do this also) but these specific values. If yes then write:

[4:5; 6:8; 2:3]
1 Like

So it’s the semicolons.

Thank you one more time, I do appreciate it.

The semicolons here simply mean vcat, whereas if you had written [4:5, 6:8, 2:3] you would have gotten a vector of ranges.

1 Like

Thanks for that. I’m reading on Base.vcat now.

This is what I tried, as I expected julia to go through a vector of ranges (or a range of vectors, if that makes sense). From the vcat docs it seems that the approach @bkamins proposed is actually to concatenate arrays along a dimension. That’s interesting. Obviously my understanding of the matter is still very basic and I need to keep working on this. Cheers.