Memory-cheap way to calculate 1D-formated, non-zero only histogram of sparse data?

question

#1

Halo,

I’ve noticed that I run out of memory extremely quickly when calculating a high-resolution histogram of three dimensional data (using StatsBase). I guess the reason is that the size of the output is proportional to the product of the length of each individual edges (I provide the edges), which skyrockets very quickly due to the required high-resolution.

I am sure I can formulate a better method because the data output I need is very specialized: I only need the non-zero elements of the histogram, and in no particular order. Therefore for my output I only need a 1D Vector (instead of 3D Array) and I do not need to save the information for the edges, or which point in the grid each histogram entry corresponds to, etc. In addition, the data provided is sparse in the sense that more than half the entries of the 3D Histogram are zero. (I do NOT have SparseMatrix)

I now want to create such an algorithm/approach, but I am wondering what is the best way to do it:

One approach I thought is to create a dictionary, and have the triplet of indexes (that is created by the place in the edges) be a key of the dictionary. Therefore in each step I will have to check whether this key exists, and if it does then I add +1 to its value. After that, I simply collect all the values and make em a 1D array.

The thing is, I got a hunch that this will be very slow? But I am also searching for other opinions before I implement this, maybe somebody got a smarter approach.


#2

If your edges are suitably distributed, you could start by building a fixed-depth octree. These are fast, and used a lot in computational physics, so you should be able to find instructive code (sorry, I don’t have a useful example to show).


#3

Ah I wish I knew how to actually use these structures. I am searching for a tutorial but I cannot seem to understand it yet. There is a julia package that implements the basic datatype, RegionTrees.jl, but I have not yet realized how I can actually apply it in my case.
Thanks for the reply!


#4

It looks like you could build a “refinery” for the RegionTrees API where cell data is an array of datapoint locations, and a cell is split only if it is not empty and still too big. While splitting, redistribute datapoints to appropriate children. After the octree is built, walk the tree to build your histogram vectors. I’ll defer to @rdeits as to whether this is a good idea.


#5

I think that’s exactly the right idea. I’ve only used RegionTrees for one application, so the API may not be quite right yet, but I think it should work.


#6

@rdeits @Ralph_Smith Care to share some hints or guidelines on how to do it? I don’t see a “good way”. At every single point I have to first check whether a fully developed partition into the lowest size cubes exists? If not I create it, if yes I add value where I need it.

Or you are saying to make the data of the leafs be a vector of 3-Tuples? But then, where do I store the values of the histogram? Remember that each 3-Tuple must map to an integer that corresponds to the amount of datapoints in the corresponding bin (equivalent to the 3-Tuple).