Best way to store a dataset with specific structure

I am new to Julia and I must store and access a weirdly shaped dataset, for which I would like to ask your advice. It’s not very easy to explain what I must do.

I must store in a variable, let say A, a collection of points with coordinates in 2D, with the following requirements (I know that I am violating basic conventions of arrays, below, I am just trying to get my point across):

  • Consider two integers K, J. For each couple of integers (k,j), where k=1,...,K, and j = 1,...,J, then A(k,j) contains n_{k,j} points each having coordinates x and y. A(k,j) can be an array with 2 columns and n_{k,j} rows, but it does not need to be an array, strictly speaking. For instance A(1,1) may contain 3 points (x1,y1), (x2,y2), (x3,y3) in whatever format is best, while A(1,2) may contain 5 points (x1,y1), ..., (x5,y5).

  • While the dataset is generated, I will sort points by their x coordinates, so that A(k,j) should store points according with increasing values of x, so that x1 <= x2 <= ...

  • Typically, I will need access only the ‘topmost’ elements of A(k,j), in the sense that I will need access for some time to the point with coordinate x1, and then, from some moment onward, only to the point with coordinate x2, and so on.

  • Finally, the number (K,J) in which the index (k,j) can vary may be moderately large.

I am not after the best or most performing solution: I am just starting to prototype for now, and for now I would like to know what is a good data structure I could use in Julia to store A. I will experiments with K and J in the order of hundreds, before potentially scaling up.

Thanks in advance

1 Like

Hello,

you mean store to disk? For what purpose do you want to store it?

If your own application is the only one accessing it and you don’t need access over along period of time, you could just serialize the data structure you use in-memory.

See Serialization · The Julia Language

However,it depends a lot on your use case.

BR Stefan

Thanks @sschmidhuber.

I must extract these points at the beginning of a simulation, and the coordinates (x,y) determine the occurrence of certain events. During the simulation, I have to monitor whether certain variables meet some conditions depending on (x,y), and I have to then take some actions during simulations.

So yes, I want to store them on disk, at least in this prototyping phase. Later on I can think of more clever way to economise, and only retain in memory the topmost one.

I am not sure I understand what it is meant by serialisation, sorry

In that case Serialization · The Julia Language should work fine. It is basically the easiest way to store data structures (random Julia objects) and read (deserialise) them later again.

The process of bringing the in-memory structure into a form to write to disk is called serialization.

Thanks @sschmidhuber maybe I am being thick, but I still don’t understand what is the basic data structure I have to use. What type of data is A? I don’t see how A can be a multidimensional array, because each A(k,j) has different dimensions. Do you understand my problem, even before I worry about performance? Does Serialisation solve my problem?

Serialization is a library that turns Julia objects into a file saved on disk so you can open them up in another session.

It seems you are actually asking a question of how to construct an Julia object with the given characteristics.

I’m on my phone right now, but I can take a crack at it when I get to my computer tomorrow.

2 Likes

How about storing all the (x,y) in one array with A(I,j) contiguous in the array, and then having another 2D array of Views to this array:

using StatsBase
W = 3
H = 4
N = 100
v = collect(zip(rand(N),rand(N)))
q = sort(sample(1:length(v), W*H-1; replace=false))
A = reshape(view.(Ref(v),(:).(vcat(1,q.+1), vcat(q,length(v)))),W,H)
foreach(A) do V
    sort!(V)
end

This code generates:

julia> A
3×4 Matrix{SubArray{Tuple{Float64, Float64}, 1, Vector{Tuple{Float64, Float64}}, Tuple{UnitRange{Int64}}, true}}:
 [(0.660423, 0.955508), (0.816744, 0.276735)]                                                                                                                …  [(0.449352, 0.302249), (0.491303, 0.0782995)]
 [(0.0977596, 0.478212), (0.486824, 0.570629), (0.498302, 0.388361), (0.553279, 0.221257)]                                                                      [(0.0689994, 0.397753), (0.118564, 0.495056), (0.361085, 0.823699), (0.607047, 0.144255), (0.647361, 0.0278309), (0.769437, 0.343957)]
 [(0.130696, 0.319694), (0.180348, 0.984854), (0.262889, 0.400248), (0.385384, 0.895865), (0.552991, 0.190508), (0.746759, 0.101692), (0.946899, 0.395204)]     []

which can be accessed:

julia> A[2,2]
2-element view(::Vector{Tuple{Float64, Float64}}, 1:2) with eltype Tuple{Float64, Float64}:
 (0.660422948459961, 0.9555079639677757)
 (0.8167439063221735, 0.27673490902901654)

Is this the type of data structure you had in mind?

I actually misunderstood your question, I thought you already have your data structure in memory and you’re looking for a way to persist it on disk.

1 Like