Accumulating mixed data and then accessing it

GlenHenshaw · July 31, 2018, 1:36pm

I have code that iterates over a long, indeterminate period of time, and at the end of each iteration produces a mixed set of data. I would like to efficiently store all of this data and then do something useful with it (like plot it) at the end.

Right now I’m attempting to do something like the following:

foo = []
for i=1:infinity
    a = f1(i) # This happens to be a Float64
    b = f2(i) # This happens to be a vector [x, y, z] of Float64's
    c = f3(i) # This happens to be an Int32
    push!(foo, [a, b, c])
end

When done, I have in foo an array of vectors. Which I can’t figure out how to do anything useful with. I can, of course, access the first vector, the second vector, etc, simply by indexing as foo[1], foo[2], etc. But what I actually need to do is pull out all of the “a” values as a single dimensional vector, all of the “b” values as a 3xN matrix, etc.

The obvious first thing to try is just indexing the foo array, ie:

a = foo[:,1]
b = foo[:,2]
c = foo[:,3]

but because foo is not, in fact, a multidimensional array, the “:” index doesn’t look into the individual vector elements of the array, so I just get the original foo array back.

A list comprehension seemed like it might do the trick:

(a, b, c) = [(bar[1], bar[2], bar[3]) for bar in foo]

but all this does is stick the last value in the “a” column into a, etc.

I can make it work by doing one list comprehension per column, ie

a = [bar[1] for bar in foo]

does in fact pull out the first column that I want. But then I have to do a separate list comprehension for every column in the array. Which seems inefficient, not to mention really really kludgy.

Is there an idiomatic way of doing this that I’m just missing? This seems like it should be straightforward, but it isn’t.

baggepinnen · July 31, 2018, 1:43pm

Check out https://github.com/baggepinnen/DeepGetfield.jl/
It was created to solve that problem for myself. The README should have some example use cases.

platawiec · July 31, 2018, 1:58pm

If I’m just prototyping a quick script, I push! to foo_a, foo_b, and foo_c independently. That way I have an explicit call to foo_a = Float64[] and my arrays aren’t filled with Any.

The other way I solve this is with a DataFrame. Presumably each iteration i is an “experiment”, which you record the results of in the columns. Your code becomes:

foo = DataFrame(a = [], b = [], c = [])
for i=1:infinity
    a = f1(i) # This happens to be a Float64
    b = f2(i) # This happens to be a vector [x, y, z] of Float64's
    c = f3(i) # This happens to be an Int32
    push!(foo, [a, b, c])
end

And you can access them via foo[:a].