I have an array Xvals, of type Array{Array{Float64,1},1} which corresponds to a time series for a system in dimension d. I’d like to quickly look at the different components, from a longitudinal point of view; in other wrods, I want to plot (n, x_i(n)) for n=1,\ldots, and i\in\{1,2,\ldots,d\}. A “cheap” way to do this is to just create a new array:
X1=[X[1] for X in Xvals]
which can then be examined, but I was wondering if there were some more clever way to handle this, without creating new data structures.
It is always better to use a Matrix than a Vector{Vector{T}} for performance reasons. It looks like it would be useful for what you want to do as well, for example, you could do
Basically a Vector{Vector{T}} is a pointer to a bunch of pointers (one for each inner Vector). The actual data contained by the inner Vectors are therefore not guaranteed to be stored contiguously or even in any sort of regular pattern. A Matrix is much simpler. While Julia stores data column-wise so that elements in adjacent columns are separated, this separation is guaranteed to be regular and the entire Array is contiguous. This makes for much more efficient access of the underlying data.
Since you are doign physics I suggest that you may be far happier with a simple Matrix than a dataframe or anything else like that.
That’s true in theory. In practice, you have to work with views, which are not 100% optimized yet, as far as I know.
In any case if you care about performance, an array of StaticArrays is probably your best bet for fast manipulations (same memory layout as a matrix, but X[i] [j] indexing), so I wouldn’t rush to change the structure of your code just yet.
I might be wrong, but I don’t think this is true in general. If you operate over rows, for example, of a matrix, you will see a big penalty using a Matrix instead of Vector{Vector{T}}. In the latter case you will be accessing each inner vector in the natural memory layout. See the following example:
function test_mat_vec()
vv = [randn(5000) for i=1:5000]
mm = Array{Float64}(uninitialzed, 5000,5000)
for j = 1:5000, i = 1:5000
mm[i,j] = vv[i][j]
end
s0, s1 = 0.0, 0.0
t0 = @elapsed for i = 1:5000
@views s0 += sum(mm[i,:])
end
t1 = @elapsed for i = 1:5000
s1 += sum(vv[i])
end
println("Matrix: Res = $s0, time = $t0")
println("Vector: Res = $s1, time = $t1")
end
test_mat_vec()
Matrix: Res = -701.7163654267092, time = 0.247252478
Vector: Res = -701.7163654267131, time = 0.012297937
This is misleading: the point is to compare A[i,j] with A[j][i], you’re comparing A[i,j] with A[i][j] (so that the “fast” index is the same in both cases). Moreover, your example makes a whole copy of mm[i,:]: a better comparison is using a view (which are bad in 0.6 but are supposed to be good in 0.7)
100%: if you’re working with regular 2d data you want a matrix with the orientation (data in rows or columns) chosen carefully so that it matches your operations well.
Yes, but if I understand this correctly, in the Vector{Vector{T}} case a priori you have no guarantees about the memory layout and pretty much have to “get lucky”. In your example, your Vector{Vector{T}} was faster because you cleanly initialized it in such a way that you wound up with adjacent data and you separated your sum in the “wrong” way for the matrix. If you deal with matrices at least you know what you are going to get. If you had done sum(mm[:,i]) instead, it would be guaranteed to be at least as fast as the Vector{Vector{T}} case.
You are right, they are essentially the same, but still Vector{Vector{T}} has an advantage when speaking about allocations, plus a cleaner syntax (mm[:,i] vs. vv[i]). A more precise example:
using BenchmarkTools
function test_mat(mm)
s0 = 0.0
@inbounds for i = 1:5000
@views s0 += sum(mm[:,i])
end
s0
end
function test_vec(vv)
s1 = 0.0
@inbounds for i = 1:5000
s1 += sum(vv[i])
end
s1
end
const vv = [randn(5000) for i=1:5000]
const mm = Array{Float64}(5000,5000)
for j = 1:5000, i = 1:5000
mm[i,j] = vv[i][j]
end
@btime s0 = test_mat(mm)
@btime s1 = test_vec(vv)
12.438 ms (5000 allocations: 234.38 KiB)
12.347 ms (0 allocations: 0 bytes)
I should add that part of the reason I coded this in terms of Vector{Vector{T}} was so that I could use the . operation on functions which would take in Vector{T} valued input. For instance, in my original example, if I want the time series of the norms at each time slice, I can use norm.(Xvals). When I got started in Julia, there didn’t seem to be as obvious a way to perform such a vectorized operation were the data to be Matrix formatted.
Yes, if this isn’t some dedicated algorithm and the rows (or columns) of matrix represent objects in their own right, what I’d do is a Vector of Julia structs, which may in turn contain StaticArrays. You may sacrifice a little bit of performance this way but one of the best things about Julia is that your custom types will have performance comparable to built in types. (This has been my approach when doing physics in Julia. Even though I use them every day in data science, I don’t find DataFrames useful for physics, in fact I might go as far as to say that they are an artifact of how badly data is usually organized in private industry.)