There are also some cases where Vector{Vector{T}} becomes the natural choice. For example, I was once trying to find and store all non-zero elements in each row of a big random binary matrix. The number of non-zero elements is different from one row to another. With Vector{Vector{T}}, we can build irregular arrays like this:
For processing the data, I think it really depends on how much data - is it even possible to keep it all in memory? Also, a lot depends on how much data can be stored in memory in terms of how long a period of time can be handled without going out to disk.
As another data point, LightGraphs uses Vector{Vector{T}} to store vertex adjacency information. (We keep the inner vectors sorted to improve lookup performance, bearing the cost at insertion time.)
There is a lot of overhead with Julia the Vector type.
If the rows are not being modified (in length at least) once added to the vector of vectors,
and rows are just added at the end, it might be good to use two vectors, one of the values, and another of the offsets into that vector.
That will likely keep data closer together in memory, and be smaller.
The offset vector might simply be UInt32 as well.
Your question is too general to be answered concretely. There are various trade-offs, which depend on the data size and structure.
The most general format is usually called “tidy data”, and conceptually maps to relational databases. You can use the latter, or a DataFrame. See the related papers on how to handle that idiomatically.
That said, “tidy data” can be very redundant. If your data is very large, you may have to consider custom solutions. Eg when dealing with a large dataset, I wrote a thin layer for indexing a ragged array into a vector.
Some of these solutions, when we get down to bit arrays and pointers, start to be faraway departures into Array internals, distant from thinking about the physics problem at hand. While they are fun, they are hard to maintain and may not offer any tangible benefit to the problem at hand.
For example, in the original @Seif_Shebl example above, with 5000x5000 samples there was no timing difference. The allocation difference in that example would go away with longitudinal access (the subject of this thread).
If the problem at hand is handled simply by a Matrix, and calculates nearly as fast as something esoteric - then by all means, I suggest we be kind to future code maintainers and just use one. If it’s ragged data (not the case here) or super-wide or super-long, then adjust strategies. Future Julia versions will keep optimizing the internals, which will break esoteric solutions, and make simpler implementations faster.
At least I’ll just speak for myself and do that in my own code. I do not want to saddle myself or others in future Julia 2.3 usage to figure out this workaround for how the internals were laid out way back in 0.6.
Thinking about the cost of Matrix, I ran some code to get the timing differences. Actually, it appears to me that the original @Seif_Shebl code had the matrix code operating on columns rather than rows. If you compare row access to row access, and column access to column access, the two approaches differ by about 10x. Vector-of-Vector is 10x faster for row access, Matrix is 14x faster for column access.
In terms of absolute speed for a 5000x5000 dataset, we’re still only talking about 0.1-0.2 seconds - so I’ll stand by my original argument for simplicity and Julia growth - but that’s the cost. If any of you want to try a Ptr solution in the same rig, feel free. Most of the time cost comes from non-contiguous memory access.
Timing results:
INFO: creating data
INFO: timing matrix on rows
166.488 ms (5000 allocations: 234.38 KiB)
INFO: timing vector on rows
17.905 ms (0 allocations: 0 bytes)
Method Diff = -1.3642420526593924e-11
INFO: timing matrix on columns
17.333 ms (5000 allocations: 234.38 KiB)
INFO: timing vector on columns
238.989 ms (0 allocations: 0 bytes)
Method Diff = 1.77351466845721e-10
Benchmark code:
using BenchmarkTools
function test_mat_row(mm)
s0 = 0.0
@inbounds for i = 1:5000
s0 += sum(view(mm, i, :))
end
s0
end
function test_mat_col(mm)
s0 = 0.0
@inbounds for i = 1:5000
s0 += sum(view(mm, :, i))
end
s0
end
function test_vec_row(vv)
s1 = 0.0
@inbounds for i = 1:5000
s1 += sum(vv[i])
end
s1
end
function test_vec_col(vv)
s1 = 0.0
@inbounds for j = 1:5000, i = 1:5000
s1 += vv[i][j]
end
s1
end
info("creating data")
const vv = [randn(5000) for i=1:5000]
const mm = Array{Float64}(5000,5000)
for j = 1:5000, i = 1:5000
mm[i,j] = vv[i][j]
end
info("timing matrix on rows")
sm1 = @btime test_mat_row(mm)
info("timing vector on rows")
sv1 = @btime test_vec_row(vv)
println("Method Diff = $(sm1 - sv1)")
info("timing matrix on columns")
sm2 = @btime test_mat_col(mm)
info("timing vector on columns")
sv2 = @btime test_vec_col(vv)
println("Method Diff = $(sm2 - sv2)")
I think this conclusion is not precise because you have coded test_vec_col(vv) differently from test_mat_row(mm). IMO, test_vec_col(vv) should be coded in the same way as test_mat_row(mm) like this:
function test_vec_col(vv)
s1 = 0.0
@inbounds for j = 1:5000
@views s1 += sum(vv[:][j])
end
s1
end
In light of this, my conclusion is: Vector-of-Vector is 20x faster for row access, Matrix is 1x faster for column access.
These are the results on my desktop, which again ascertain that Vector-of-Vector is the winner here.
INFO: creating data
INFO: timing matrix on rows
246.274 ms (5000 allocations: 234.38 KiB)
INFO: timing vector on rows
12.071 ms (0 allocations: 0 bytes)
Method Diff = -4.547473508864641e-12
INFO: timing matrix on columns
11.806 ms (5000 allocations: 234.38 KiB)
INFO: timing vector on columns
12.248 ms (5000 allocations: 234.38 KiB)
Method Diff = -1.7280399333685637e-11
Edit:
Oops, yes I was wrong, It does the same thing as sum(vv[j]). Also, I agree with you that in many cases, one can construct one’s own data structures that out-perform the builtin-matrix, I did in some cases. At the end, the nature of the physical problem that we are trying to solve will drive us to favor one approach over another.
Comparing row access for Matrix vs Vector of Vector doesn’t make sense because if you wanted that optimized you would have transposed the Matrix. A fair comparison would be to measure the best case scenario for both vs worst case scenario for both, i.e. row acces for Matrix, and accesing the n-th element out of all the Vector of Vectors.
That’s not doing a column-wise sum — it’s identical to sum(vv[j]):
julia> vv = [rand(4) for _ in 1:10];
julia> sum(vv[:][3])
1.664930579622523
julia> sum(vv[3])
1.664930579622523
julia> sum(vv[i][3] for i in 1:10)
3.5584368084828224
This all just comes down to the operations you’re wanting to do and how important it is to eek out that last bit of performance. Yes, in many cases, you can construct your own data structures that out-perform the builtin-matrix… but as you head off the beaten trail you may find more and more surprises like this. Note that column-wise vs. row-wise sums aren’t inherently many factors slower — Base’s built-in algorithms compensate for the storage structure and work hard to accommodate non-optimal access patterns. Try sum(A, 1) vs. sum(A, 2).
Yeah, sorry for my tangent — I was in no way advocating for using these as workarounds. I think I’ll try to split those posts out.
Wow, sum(mm, 1) is faster than any of the other row methods, including vv. This is having our cake (clear data structure) and eating (summing it quickly) too.
@mbauman, where is that sum(Array, dim) in the code - I can’t find it in Base/array*.jl? Is there a way to apply that approach to other methods (e.g. norm) which don’t have a dimension argument?
And hey, we’re not fighting here about Vector-Vector vs Matrix here… it’s the same language, same community, and do what you will with it.
Just to thank those who have chimed in and reiterate what I’m getting at here:
It is often sensible to generate time series data of the form an array of arrays, in that it logically makes sense to record variables associated with the same time slice to be a part of the same array.
It can make sense to use arrays of arrays over matrices in the event that there is variability in the size of the arrays at each time slice or the data recorded at each time slice is not all of the same type (i.e., a mix of floating point, integer, and categorical values)
It would be convenient, in the post processing step, to be able to have a concise way to extracting the components in order to look at their longitudinal variation. I have been doing this by creating new arrays and populating them, but I was curious to hear what others thought, or if there should be a better mechanism.