Efficiently Store and Access Large 3-Dimensional Arrays

I am working on a simulation that stores and remove values (some form of float) at various points in a 3-dimensional space, represented by an array with a size of ~(2e4 x 4e4 x 1e2). Specifically, it records the property of the sediments (e.g. the average size of the sand grains) that get deposited or eroded from an ocean basin with uneven sea-bed surface elevation (measured from some uniform “base” level, which forms the bottom layer of the 3-D array, with “depth” being the shortest dimension). Aside: There might be a need to have additional arrays like this to store other relevant values.

Most of these deposition/erosion events happen in the central (i.e., not near the edges) part of this array, at seemingly random points (depending on the overlying water current).

With an array this size, memory will be overwhelmed if I use a simple 3-D array (right?). I would like to hear your suggestions on what best to do. So far, I have considered:

(i) Using a collection of sparse matrices, each one being a “slice” of the 3-D grid along the length or width. (The slicing is because I could not find a sparse-array implementation in Julia that is not limited to 2-D.) The issue with this is how best to automatically allocate the correct number of slices at the beginning of the simulation, when the length and width of the ocean basin gets specified, and how to reference/track them efficiently as the simulation progresses. How to write sparse arrays to file(s) without converting them back into full arrays is also a question that I have.

(ii) Using something like the Zarr implementation (but adapted to Julia, of course). My concern about this is whether dividing the array into compressed “chunks” that are slices of the 3-D array is going to help with the speed of accessing random points in the central part of this 3-D grid, given the additional computational time needed to split and compress these “chunks”.

I would love to hear any critique and comments on what I have considered above, and any suggestions or ideas that you might have.

Thank you!

The question of what data structure to use depends a lot on what operations you need to do frequently. What operations do you need to be fast?

For example, many types of particle simulations mainly need “closest point” operations to be fast, and hence use some kind of tree (e.g. a kd tree or octtree).

2 Likes

There are no mathematical operations that need to be performed on this array, merely access and assignment of individual elements. There isn’t much predictability when it comes to where these access and assignments might happen, unfortunately, except that they occur on the layers close to the surface (i.e., the top-most non-zero values of each column when looking down into the shortest dimension of the array). The determination of where these assignments should happen depends on some arithmetic and comparisons of other 2-D arrays (existing and implemented) immediately above the “surface” of the 3-D array.

I guess I need to make locating and replacing values in the array as fast as possible without blowing up the memory, while still being able to write the array to a file (can be binary) without some heavy conversion.

this certainly seems like an excellent application for HDF5 but i’m unsure of the state of julia’s HDF5 package.

https://github.com/JuliaIO/HDF5.jl

2 Likes

This is the key question. How do you decide which elements to access? If it is some kind of nearest-point question, you might use a tree structure of some sort.

(I don’t see how HDF5 helps here. It is not really designed for sparsity.)

1 Like

I see. The way decisions are made is by comparing a set of parameters (already calculated from other parts of the simulation) to their corresponding threshold values. These parameters are also element by element in individual 2-D arrays, corresponding to the larger dimensions of the 3-D array (length and width; let’s call them x and y).

If the parameters at a specific x and y crosses the “erode” threshold, the top-most (in the depth dimension; let’s call it z) non-zero element (or elements, depending on how strongly the parameters crosses the threshold) at that x and y coordinate is assigned zero. Where the “top-most non-zero element” is, is not uniform across the x-y plane nor static through the time steps of the simulation.

Similarly, if the parameters at a specific x and y fall below the “deposit” threshold (not the same as and is smaller in value than the “erode” one), assign appropriate non-zero value(s) to the element(s) above (in the z direction) existing non-zero values.

Every element in the x-y grid of the parameters are checked against these thresholds at the end of each time step, and then the appropriate removal or assignment of values are applied to the eligible elements in the 3-D grid. Since this is part of a somewhat random fluid flow problem, the only thing I can say about the proximity of the elements to be modified is that they tend to cluster as non-straight line segments (pieces of discretised curves when projected on the x-y plane) of variable width, and are in the top-most one or a few non-zero elements in the z columns involved.

I am unsure if I am helping with the above description. Please let me know if I am not being clear or if I am misunderstanding your question. Thank you!

If you need to store information at each xyz grid point, it’s not a job for a sparse array. With 2e4*4e4*1e2 Float64 elements, you’re looking at 640 GB of memory, or several times more than that if you need to store multiple properties per grid point. If you want to keep everything in memory, you’ll need to run this on a MPI cluster; trying to run this as a (disk-based) memory-mapped array will absolutely kill your performance. Can you provide more detail about the algorithm, the parameters & the (presumably coupled) fluids problem?

Making a few assumptions about the structure of your problem, it seems like you don’t necessarily need to store the entire z-slice in memory, since it’ll mostly be water or undisturbed sand. What determines the initial distribution of sand particle size & seafloor height? If you’re smart about storing only what’s needed, you can really cut down on resource requirements. Sketching out a solution,

struct Sediment{T}
    ⌀::T    # average grain diameter
    ρ::T    # density
end

struct SedimentColumn{T}
    z::T # location of seafloor
    sand::Vector{Sediment{T}} # sediment storage array
end

seafloorheightinit(x, y) = sin(x) + cos(y)
ρinit(z) = 2.2 + z*randn()
⌀init(z) = 3e-3 + 1e-3*randn()

function SedimentColumn(x, y)
    z = seafloorheightinit(x, y)
    # initialize only the top layer of sediment
    topsand = [Sediment(⌀init(z - h), ρinit(z - h)) for h in 0:3]
    return SedimentColumn(z, topsand)
end

seafloor = [SedimentColumn(x, y) for x in LinRange(0, 40, 400), y in LinRange(0, 80, 1000)]

You can then operate on the 2D array of sediment columns instead of traversing through a mostly-static 3D array. You’d have to add logic for how to re-fill a sediment column if it’s emptied by erosion (re-fill using values from a user-defined function? from a database stored on disk?).

2 Likes

Thank you for the clever sketch! :smiley: I suppose the idea then (re: the memory limit) is to decide which columns are inactive and can be stored on disk; am I understanding correctly?

Indeed, that was what got me worrying about this at first. In practice, only about half (the bottom half, irregularly; higher in z-direction in the central portions of the x-y plane and lower closer to the edges) of the elements will be filled with non-zeros by the end. I guess that fraction of zero elements still does not make sparse arrays very helpful.

The simulation does not actually solve the fluid-dynamics equations, due to the very long time-scale that we are looking. It utilises a much simpler rule-based approach: sending fluid “particles” carrying specific properties, and do some random-walk-like guiding. After thousands and thousand of these particles, the whole picture slowly build up to some “flow field”, with some averaged directional and flow-speed information. These are the values stored in the 2-D arrays and compared to the threshold values at regular intervals, after a batch of the particles had passed through the simulation domain.

The particle-size distribution is prescribed at the start of the simulation and does not evolve. It is the size-dependent threshold values that cause the variations expected in the average deposited-particle size. The seafloor height is currently flat at the start of the simulation (i.e., like an uncomfortable bathtub with rock-slab bottom), but may get some simple tilt in the future.

The idea isn’t to selectively store columns in memory, since the somewhat random read/write pattern will impose heavy IO burdens. Instead, you’d only initialize the top layer of sediment (across the whole domain) in memory, using the pop! and push! functions to erode and deposit sand from each column. That way, you don’t have to waste memory on undisturbed sediment or seawater. You can lazily fill in the lower extent of the columns as they’re eroded below the region you’ve initialized, or offload the lower extent of columns in regions of high deposition if you’re running out of RAM.

1 Like

If you are doing a particle-tracking method for fluid simulation, my impression is that you don’t need a 3d array at all. Why not just have something like:

  1. an array data of coordinates (probably use SVector{3,Float64} for the coordinate vectors data[i] via StaticArrays), one element per particle that you are tracking

  2. a tree (e.g. via the NearestNeighbors package) so that you can quickly look up particles in a given region of space (e.g. near the boundaries). or possibly a 3d array of coarse-grained “boxes” in space that each contain a list of particles in each box.

2 Likes

It seems that the particle advection step is not the bottleneck here, but rather the supposed need for at least 640 GB of memory to store a single scalar property at every point within the massive 3D domain (assuming every grid point is stored as the OP describes). Given that most grid points are completely static at a given timestep, and at most a few colocated points at the sediment-seawater interface might be read/written within a given z-column, the question is how best to represent and cache that portion of the data, and thereby boil it down to a simpler, cheaper, quasi-2D problem.

1 Like

Thank you for your helpful suggestions and discussions! I learned a lot and will proceed to apply what I learned. :blush:

@Dr.Merkwuedigliebe, I recently came across the AIBECS.jl ocean circulation/transport modeling package, which seems relevant to your work.

1 Like