Quick Longitudinal Access to arrays

gideonsimpson · February 23, 2018, 2:50pm

I have an array Xvals, of type Array{Array{Float64,1},1} which corresponds to a time series for a system in dimension d. I’d like to quickly look at the different components, from a longitudinal point of view; in other wrods, I want to plot (n, x_i(n)) for n=1,\ldots, and i\in\{1,2,\ldots,d\}. A “cheap” way to do this is to just create a new array:

X1=[X[1] for X in Xvals]

which can then be examined, but I was wondering if there were some more clever way to handle this, without creating new data structures.

ExpandingMan · February 23, 2018, 2:53pm

It is always better to use a Matrix than a Vector{Vector{T}} for performance reasons. It looks like it would be useful for what you want to do as well, for example, you could do

X[1,:]

Nosferican · February 23, 2018, 3:26pm

Matrix assumes the number of elements in each dimension is shared across groups which needs not be the case.

gideonsimpson · February 23, 2018, 3:36pm

Matrix performs better, generically? How so?

Nosferican · February 23, 2018, 3:37pm

(1) Tidy the data in long-form (each row is one observation and each column is a variable)

using DataFrames
srand(0)
df = DataFrame(x = repeat(1:10, inner = 10), t = repeat(1:10, outer = 10), y = rand(100))
head(df)

(2) Generate the indicators for each category

X = map(elem -> find(elem .== df[:x]), unique(df[:x]))
T = map(elem -> find(elem .== df[:t]), unique(df[:t]))

(3) Apply whatever routine you want using the grouping

map(unit -> mean(df[:y][unit]), X)
map(period -> mean(df[:y][period]), T)

gideonsimpson · February 23, 2018, 3:39pm

True, but my problems are physics based, so I really do have the same number of coordinates at every time slice.

ExpandingMan · February 23, 2018, 3:51pm

Basically a Vector{Vector{T}} is a pointer to a bunch of pointers (one for each inner Vector). The actual data contained by the inner Vectors are therefore not guaranteed to be stored contiguously or even in any sort of regular pattern. A Matrix is much simpler. While Julia stores data column-wise so that elements in adjacent columns are separated, this separation is guaranteed to be regular and the entire Array is contiguous. This makes for much more efficient access of the underlying data.

Since you are doign physics I suggest that you may be far happier with a simple Matrix than a dataframe or anything else like that.

Nosferican · February 23, 2018, 3:56pm

If you have a balanced panel and choose the Matrix representation depending on the function, you could do something like.

srand(0)
obj = rand(20,10)
mean(obj, 1)

or

using StatsBase
compute_mean_over_std(obj::Vector{T}) where T <: Real =
   reduce(/, mean_and_std(obj))
mapslices(compute_mean_over_std, obj, 1)[:]

If you need something more flexible than mapslices,

for col ∈ 1:size(obj, 2)
    println(mean(slicedim(obj, 2, col)))
end

antoine-levitt · February 23, 2018, 4:12pm

That’s true in theory. In practice, you have to work with views, which are not 100% optimized yet, as far as I know.

In any case if you care about performance, an array of StaticArrays is probably your best bet for fast manipulations (same memory layout as a matrix, but X[i] [j] indexing), so I wouldn’t rush to change the structure of your code just yet.

Seif_Shebl · February 23, 2018, 7:57pm

I might be wrong, but I don’t think this is true in general. If you operate over rows, for example, of a matrix, you will see a big penalty using a Matrix instead of Vector{Vector{T}}. In the latter case you will be accessing each inner vector in the natural memory layout. See the following example:

function test_mat_vec()
	vv = [randn(5000) for i=1:5000]
	mm = Array{Float64}(uninitialzed, 5000,5000)

	for j = 1:5000, i = 1:5000
		mm[i,j] = vv[i][j]
	end 

	s0, s1 = 0.0, 0.0
	t0 = @elapsed for i = 1:5000
		@views s0 += sum(mm[i,:])
	end
	t1 = @elapsed for i = 1:5000
		s1 += sum(vv[i])
	end

	println("Matrix: Res = $s0, time = $t0")
	println("Vector: Res = $s1, time = $t1")
end

test_mat_vec()

Matrix: Res = -701.7163654267092, time = 0.247252478
Vector: Res = -701.7163654267131, time = 0.012297937

antoine-levitt · February 23, 2018, 8:05pm

This is misleading: the point is to compare A[i,j] with A[j][i], you’re comparing A[i,j] with A[i][j] (so that the “fast” index is the same in both cases). Moreover, your example makes a whole copy of mm[i,:]: a better comparison is using a view (which are bad in 0.6 but are supposed to be good in 0.7)

StefanKarpinski · February 23, 2018, 8:21pm

100%: if you’re working with regular 2d data you want a matrix with the orientation (data in rows or columns) chosen carefully so that it matches your operations well.

pasha · February 23, 2018, 8:21pm

Seconds for a Matrix in terms of sanity and clarity. It also depends quite a bit on the expected input format for functions you want to use.

Have you looked at the Julia implementations of time series, TimeSeries.jl and its TimeArray , or Temporal.jl and its ts ?

ExpandingMan · February 23, 2018, 8:43pm

Yes, but if I understand this correctly, in the Vector{Vector{T}} case a priori you have no guarantees about the memory layout and pretty much have to “get lucky”. In your example, your Vector{Vector{T}} was faster because you cleanly initialized it in such a way that you wound up with adjacent data and you separated your sum in the “wrong” way for the matrix. If you deal with matrices at least you know what you are going to get. If you had done sum(mm[:,i]) instead, it would be guaranteed to be at least as fast as the Vector{Vector{T}} case.

pasha · February 23, 2018, 9:01pm

I see in the docs and code that DenseArrays are stored column-major. Is there a different constructor or Type that stores row-major?

Addition: Looking at related issues 5932 and 4774 give me a deep appreciation for the stuff the founders had to slog through…

Seif_Shebl · February 23, 2018, 9:13pm

You are right, they are essentially the same, but still Vector{Vector{T}} has an advantage when speaking about allocations, plus a cleaner syntax (mm[:,i] vs. vv[i]). A more precise example:

using BenchmarkTools

function test_mat(mm)
	s0 = 0.0
	@inbounds for i = 1:5000
		@views s0 += sum(mm[:,i])
	end
	s0
end
function test_vec(vv)
	s1 = 0.0
	@inbounds for i = 1:5000
		s1 += sum(vv[i])
	end
	s1 
end
const vv = [randn(5000) for i=1:5000]
const mm = Array{Float64}(5000,5000)
for j = 1:5000, i = 1:5000
	mm[i,j] = vv[i][j]
end 
@btime s0 = test_mat(mm)
@btime s1 = test_vec(vv)

  12.438 ms (5000 allocations: 234.38 KiB)
  12.347 ms (0 allocations: 0 bytes)

antoine-levitt · February 23, 2018, 9:21pm

Yes, that is the allocation of the view object. On 0.7 it should go away.

gideonsimpson · February 23, 2018, 9:57pm

I should add that part of the reason I coded this in terms of Vector{Vector{T}} was so that I could use the . operation on functions which would take in Vector{T} valued input. For instance, in my original example, if I want the time series of the norms at each time slice, I can use norm.(Xvals). When I got started in Julia, there didn’t seem to be as obvious a way to perform such a vectorized operation were the data to be Matrix formatted.

BLI · February 23, 2018, 10:02pm

Many physics based problems are unsuitable for Matrix organization of data. Consider:

logging sensor data at different time intervals (e.g, temperature every 10 minutes, pressure every minute)
regular vs irregular time intervals (e.g, temperature at regular interval, lab analysis results at irregular interval)
different data types (e.g, scalar temperature, fixed size raw image or variable size jpg image)
data with varying size over time (e.g, regular interval of IoT temperature data, where new sensors are added or removed each time a system is sampled)

What is an efficient way to store and retrieve such data? (Array? Dictionary? DataFrame? Database?.. depending on amount of data, etc?)

ExpandingMan · February 23, 2018, 11:24pm

Yes, if this isn’t some dedicated algorithm and the rows (or columns) of matrix represent objects in their own right, what I’d do is a Vector of Julia structs, which may in turn contain StaticArrays. You may sacrifice a little bit of performance this way but one of the best things about Julia is that your custom types will have performance comparable to built in types. (This has been my approach when doing physics in Julia. Even though I use them every day in data science, I don’t find DataFrames useful for physics, in fact I might go as far as to say that they are an artifact of how badly data is usually organized in private industry.)

Topic		Replies	Views
Vector of Matrices of Varying Dimensions New to Julia	1	402	January 19, 2019
Square Bracket Notation for Broadcasting getindex Across Array of Vectors? General Usage plotting , array , linearalgebra	5	1782	September 13, 2019
How to take the mean of entries across an array of DataFrames conditional upon the value of a separate column? General Usage statistics , dataframes	5	1295	May 11, 2022
One-dimensional Array access New to Julia	9	540	May 29, 2020
Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing General Usage	7	2570	July 25, 2019

Quick Longitudinal Access to arrays

Related topics