Iterate over all columns in a DataFrame

Hi. Given my poor knowledge of DataFrames, I am stuck with a stupid little problem. I have a sizeable time-series data frame involving variables expressed in millions/billions of dollars. I want to transform each column such that instead of dollars, I will have an index: the first observation across all columns will be equal to 100. Doing it to one particular column is extremely easy; the problem is to generalize it to the entire set of columns. I did my homework the best I could, and I had a look at the most recent Julia-DataFrames-Tutorial by Bogumił Kamiński but had no success. Help will be very much appreciated.
An MWE is like this:

df = DataFrame(W = [20, 22, 19, 21, 23, 25, 28, 30], 
               X = [210, 220, 230, 240, 250, 245, 260, 270],
               Z = [400, 320, 419, 426, 430, 440, 435, 444])

A for loop through the X column will be:

t = 8;
X1 = df[1,2]*100/df[1,2] # Fixing the initial value to 100
Xn = length(X1)
Xt = [X1 zeros(Xn, t-1)]; 

for i = 1:t-1
    Xt[i+1] = df[:,2][i+1] *100/ df[1,2];
end
Xt

And my Index of X will come out as:

1×8 Matrix{Float64}:
 100.0  104.762  109.524  114.286  119.048  116.667  123.81  128.571

But what is the trick I should use to apply the loop to all columns (which are not just 3 like this MWE)? This problem looks like a simple one (that is so in MATLAB). I tried many variations but always got stuck with some error.
Thanks.

1 Like

assume you only need Float64 calculation

function change_data(x)
     x = convert(Vector{Float64}, x)
     f_v = x[1]
     x .* 100 ./ f_v
 end

mapcols!(change_data, df)

Easy to read but not in place:

julia> DataFrame([c ./ first(c) for c ∈ eachcol(df)], names(df))

an in place loop could be written as

julia> for c ∈ eachcol(df)
           c ./= first(c)
       end

but in this case will fail as you have Int columns, but the division will produce Float64s - not a problem if your original data is Float64 anyway, otherwise you need to convert like in the answer above.

2 Likes

@DataFrames and @nilshg thanks a lot.

Both approaches work to perfection: simple, clear, and fast. I could choose any one of them as a solution. The first was chosen just because it came out first.