How to store data from a nested for loop?

Hello!

I have a nested for loop, which should result in a vector of vectors. Unfortunately, I’m only getting the end result, rather than each iteration. I’m probably missing something obvious about nested for loops; minimum (not actually) working example below.

using DataFrames
# Replicate Minimum Working Data
#generate DataFrame names
dfnames = ["a", "b", "c", "d", "e", "f", "g", "h", "max", "years"]
#generate column Years data
years = collect(1:10)
#generate Vector DataFrames
dfvector = []
for i in years
    dfvector = push!(dfvector, DataFrame(hcat(rand(10,9), years), dfnames))
end

#Identify the maximum value for the first dataframe
maxvaluesyr1 = zeros(length(years))
for i in years
    maxvaluesyr1[i] = maximum(dfvector[1][dfvector[1].years .== i,:].max)
end

#Identify the maximum value for the second dataframe
maxvaluesyr2 = zeros(length(years))
for i in years
    maxvaluesyr2[i] = maximum(dfvector[2][dfvector[2].years .== i,:].max)
end

The above is what I would like, but I need to loop the “for i in years” for loop over each dataframe. I tried the below example, but I kept receiving the last j loop and not the first 9. Any help is incredibly appreciated and just let me know if I can clarify further!

maxvalues = zeros(length(years))
maxvaluesvector = []
for j=1:length(dfvector)
    for i in years
        maxvalues[i] = maximum(dfvector[j][dfvector[j].years .== i,:].max)
    end
    maxvaluesvector = push!(maxvaluesvector, maxvalues)
end

maxvalues = zeros(length(years))
maxvaluesvector = []
for j=1:length(dfvector)
    for i in years
        maxvalues[i] = maximum(dfvector[j][dfvector[j].years .== i,:].max)
        maxvaluesvector = push!(maxvaluesvector, maxvalues)
    end
end

maxvalues = zeros(length(years))
maxvaluesvector = repeat([maxvalues], length(dfvector))
for j=1:length(dfvector)
    for i in years
        maxvalues[i] = maximum(dfvector[j][dfvector[j].years .== i,:].max)
        maxvaluesvector[j] = maxes
    end
end

maxvalues = zeros(length(years))
maxvaluesvector = repeat([maxvalues], length(dfvector))
for j=1:length(dfvector)
    for i in years
        maxvalues[i] = maximum(dfvector[j][dfvector[j].years .== i,:].max)
    end
    maxvaluesvector[j] = maxes
end

Let’s take the first attempt:

This allocates one maxvalues vector. For each value of j, the inner loop

for i in years
    maxvalues[i] = ...
end

overwrites the values in this vector. Then for each j the line

maxvaluesvector = push!(maxvaluesvector, maxvalues)

appends this vector maxvalues to maxvaluesvector (by the way the assignment is not necessary, you can write push!(v, ...) instead of v = push!(v, ...)).

The problem is that this always appends the same vector (the same container). You need to allocate a new vector for each j.

2 Likes

This is probably the issue. push! is pushing the same vector to maxvaluesvector every time, so every time you edit it you’re editing all the entries simultaneously. Here’s a simple example:

julia> results = []
Any[]

julia> x = [0]
1-element Vector{Int64}:
 0

julia> for i in 1:3
         x[1] = i
         push!(results, x)  # Pushing the *same* vector every time
       end

julia> results
3-element Vector{Any}:
 [3]
 [3]
 [3]

If you want your results to be different vectors, then you need to make that explicit. One easy way in this case is to copy when you push!:

julia> results = []
Any[]

julia> for i in 1:3
         x[1] = i
         push!(results, copy(x))
       end

julia> results
3-element Vector{Any}:
 [1]
 [2]
 [3]

Edit: Yup, what @sijo said :slightly_smiling_face:

3 Likes

Thanks @sijo, and @rdeits. I understand overwriting the initial vector is the expected behaviour and I expected that too. I didn’t expect it wouldn’t append the vector for each iteration of the outer loop, which could be thought of as each total inner loop.

I thought nested for loops followed logic like:

  1. Iterate over each inner loop value [i] and store the results e.g. in a vector for the first outer loop value [j]
  2. push! would store that first total inner loop given the first outer loop value [j]
  3. The outer loop value [j] changes to [j] + 1 for example and the inner loop repeats and overwrites the initial storage (i.e. the first vector).
  4. Push! then appends the overwritten vector to the outer loop vector, which becomes a vector of vectors.

It’s clear the logic is false, but can you help me understand where? If my thoughts aren’t clear above, just let me know how I can clarify!

The problems is: the overwritten vector that you are saving is always the same object, not a new object. You have two solutions:

  1. Make the line maxvalues = zeros(length(years)) the first line of the outer loop, so a new object is created each time.
  2. Change the last line of the outer loop to push!(maxvaluesvector, copy(maxvalues)) so you save a copy of the vector.

If you do not do either, what happens is that all positions of maxvaluesvector refer to the same object, that is being changed until the last iteration. You can check this by making a new change (setting the first element to zero for example) to one of the vectors inside maxvaluesvector and see that this change is reflected in all inner vectors instead of just that position.

2 Likes

Maybe it helps to look at a simpler case:

julia> v = [1,2];

julia> v_vector = [v, v, v]
3-element Vector{Vector{Int64}}:
 [1, 2]
 [1, 2]
 [1, 2]

julia> v[1] = 4;

julia> v_vector
3-element Vector{Vector{Int64}}:
 [4, 2]
 [4, 2]
 [4, 2]

Here the line v_vector = [v, v, v] is equivalent to

v_vector = typeof(v)[]   # Empty vector of values with type like `v`
push!(v_vector, v)
push!(v_vector, v)
push!(v_vector, v)

In both cases v_vector contains three references to the same object.

2 Likes