Quick Longitudinal Access to arrays


#1

I have an array Xvals, of type Array{Array{Float64,1},1} which corresponds to a time series for a system in dimension d. I’d like to quickly look at the different components, from a longitudinal point of view; in other wrods, I want to plot (n, x_i(n)) for n=1,\ldots, and i\in\{1,2,\ldots,d\}. A “cheap” way to do this is to just create a new array:

X1=[X[1] for X in Xvals]

which can then be examined, but I was wondering if there were some more clever way to handle this, without creating new data structures.


#2

It is always better to use a Matrix than a Vector{Vector{T}} for performance reasons. It looks like it would be useful for what you want to do as well, for example, you could do

X[1,:]

#3

Matrix assumes the number of elements in each dimension is shared across groups which needs not be the case.


#4

Matrix performs better, generically? How so?


#5

(1) Tidy the data in long-form (each row is one observation and each column is a variable)

using DataFrames
srand(0)
df = DataFrame(x = repeat(1:10, inner = 10), t = repeat(1:10, outer = 10), y = rand(100))
head(df)

(2) Generate the indicators for each category

X = map(elem -> find(elem .== df[:x]), unique(df[:x]))
T = map(elem -> find(elem .== df[:t]), unique(df[:t]))

(3) Apply whatever routine you want using the grouping

map(unit -> mean(df[:y][unit]), X)
map(period -> mean(df[:y][period]), T)

#6

True, but my problems are physics based, so I really do have the same number of coordinates at every time slice.


#7

Basically a Vector{Vector{T}} is a pointer to a bunch of pointers (one for each inner Vector). The actual data contained by the inner Vectors are therefore not guaranteed to be stored contiguously or even in any sort of regular pattern. A Matrix is much simpler. While Julia stores data column-wise so that elements in adjacent columns are separated, this separation is guaranteed to be regular and the entire Array is contiguous. This makes for much more efficient access of the underlying data.

Since you are doign physics I suggest that you may be far happier with a simple Matrix than a dataframe or anything else like that.


#8

If you have a balanced panel and choose the Matrix representation depending on the function, you could do something like.

srand(0)
obj = rand(20,10)
mean(obj, 1)

or

using StatsBase
compute_mean_over_std(obj::Vector{T}) where T <: Real =
   reduce(/, mean_and_std(obj))
mapslices(compute_mean_over_std, obj, 1)[:]

If you need something more flexible than mapslices,

for col ∈ 1:size(obj, 2)
    println(mean(slicedim(obj, 2, col)))
end

#9

That’s true in theory. In practice, you have to work with views, which are not 100% optimized yet, as far as I know.

In any case if you care about performance, an array of StaticArrays is probably your best bet for fast manipulations (same memory layout as a matrix, but X[i] [j] indexing), so I wouldn’t rush to change the structure of your code just yet.


#10

I might be wrong, but I don’t think this is true in general. If you operate over rows, for example, of a matrix, you will see a big penalty using a Matrix instead of Vector{Vector{T}}. In the latter case you will be accessing each inner vector in the natural memory layout. See the following example:

function test_mat_vec()
	vv = [randn(5000) for i=1:5000]
	mm = Array{Float64}(uninitialzed, 5000,5000)

	for j = 1:5000, i = 1:5000
		mm[i,j] = vv[i][j]
	end 

	s0, s1 = 0.0, 0.0
	t0 = @elapsed for i = 1:5000
		@views s0 += sum(mm[i,:])
	end
	t1 = @elapsed for i = 1:5000
		s1 += sum(vv[i])
	end

	println("Matrix: Res = $s0, time = $t0")
	println("Vector: Res = $s1, time = $t1")
end

test_mat_vec()

Matrix: Res = -701.7163654267092, time = 0.247252478
Vector: Res = -701.7163654267131, time = 0.012297937

#11

This is misleading: the point is to compare A[i,j] with A[j][i], you’re comparing A[i,j] with A[i][j] (so that the “fast” index is the same in both cases). Moreover, your example makes a whole copy of mm[i,:]: a better comparison is using a view (which are bad in 0.6 but are supposed to be good in 0.7)


#12

100%: if you’re working with regular 2d data you want a matrix with the orientation (data in rows or columns) chosen carefully so that it matches your operations well.


#13

Seconds for a Matrix in terms of sanity and clarity. It also depends quite a bit on the expected input format for functions you want to use.

Have you looked at the Julia implementations of time series, TimeSeries.jl and its TimeArray , or Temporal.jl and its ts ?


#14

Yes, but if I understand this correctly, in the Vector{Vector{T}} case a priori you have no guarantees about the memory layout and pretty much have to “get lucky”. In your example, your Vector{Vector{T}} was faster because you cleanly initialized it in such a way that you wound up with adjacent data and you separated your sum in the “wrong” way for the matrix. If you deal with matrices at least you know what you are going to get. If you had done sum(mm[:,i]) instead, it would be guaranteed to be at least as fast as the Vector{Vector{T}} case.


#15

I see in the docs and code that DenseArrays are stored column-major. Is there a different constructor or Type that stores row-major?

Addition: Looking at related issues 5932 and 4774 give me a deep appreciation for the stuff the founders had to slog through…


#16

You are right, they are essentially the same, but still Vector{Vector{T}} has an advantage when speaking about allocations, plus a cleaner syntax (mm[:,i] vs. vv[i]). A more precise example:

using BenchmarkTools

function test_mat(mm)
	s0 = 0.0
	@inbounds for i = 1:5000
		@views s0 += sum(mm[:,i])
	end
	s0
end
function test_vec(vv)
	s1 = 0.0
	@inbounds for i = 1:5000
		s1 += sum(vv[i])
	end
	s1 
end
const vv = [randn(5000) for i=1:5000]
const mm = Array{Float64}(5000,5000)
for j = 1:5000, i = 1:5000
	mm[i,j] = vv[i][j]
end 
@btime s0 = test_mat(mm)
@btime s1 = test_vec(vv)

  12.438 ms (5000 allocations: 234.38 KiB)
  12.347 ms (0 allocations: 0 bytes)

#17

Yes, that is the allocation of the view object. On 0.7 it should go away.


#18

I should add that part of the reason I coded this in terms of Vector{Vector{T}} was so that I could use the . operation on functions which would take in Vector{T} valued input. For instance, in my original example, if I want the time series of the norms at each time slice, I can use norm.(Xvals). When I got started in Julia, there didn’t seem to be as obvious a way to perform such a vectorized operation were the data to be Matrix formatted.


#19

Many physics based problems are unsuitable for Matrix organization of data. Consider:

  • logging sensor data at different time intervals (e.g, temperature every 10 minutes, pressure every minute)
  • regular vs irregular time intervals (e.g, temperature at regular interval, lab analysis results at irregular interval)
  • different data types (e.g, scalar temperature, fixed size raw image or variable size jpg image)
  • data with varying size over time (e.g, regular interval of IoT temperature data, where new sensors are added or removed each time a system is sampled)

What is an efficient way to store and retrieve such data? (Array? Dictionary? DataFrame? Database?.. depending on amount of data, etc?)


#20

Yes, if this isn’t some dedicated algorithm and the rows (or columns) of matrix represent objects in their own right, what I’d do is a Vector of Julia structs, which may in turn contain StaticArrays. You may sacrifice a little bit of performance this way but one of the best things about Julia is that your custom types will have performance comparable to built in types. (This has been my approach when doing physics in Julia. Even though I use them every day in data science, I don’t find DataFrames useful for physics, in fact I might go as far as to say that they are an artifact of how badly data is usually organized in private industry.)