Memory-cheap way to calculate 1D-formated, non-zero only histogram of sparse data?

Datseris · July 16, 2017, 7:41pm

Halo,

I’ve noticed that I run out of memory extremely quickly when calculating a high-resolution histogram of three dimensional data (using StatsBase). I guess the reason is that the size of the output is proportional to the product of the length of each individual edges (I provide the edges), which skyrockets very quickly due to the required high-resolution.

I am sure I can formulate a better method because the data output I need is very specialized: I only need the non-zero elements of the histogram, and in no particular order. Therefore for my output I only need a 1D Vector (instead of 3D Array) and I do not need to save the information for the edges, or which point in the grid each histogram entry corresponds to, etc. In addition, the data provided is sparse in the sense that more than half the entries of the 3D Histogram are zero. (I do NOT have SparseMatrix)

I now want to create such an algorithm/approach, but I am wondering what is the best way to do it:

One approach I thought is to create a dictionary, and have the triplet of indexes (that is created by the place in the edges) be a key of the dictionary. Therefore in each step I will have to check whether this key exists, and if it does then I add +1 to its value. After that, I simply collect all the values and make em a 1D array.

The thing is, I got a hunch that this will be very slow? But I am also searching for other opinions before I implement this, maybe somebody got a smarter approach.

Ralph_Smith · July 16, 2017, 10:25pm

If your edges are suitably distributed, you could start by building a fixed-depth octree. These are fast, and used a lot in computational physics, so you should be able to find instructive code (sorry, I don’t have a useful example to show).

Datseris · July 16, 2017, 11:27pm

Ah I wish I knew how to actually use these structures. I am searching for a tutorial but I cannot seem to understand it yet. There is a julia package that implements the basic datatype, RegionTrees.jl, but I have not yet realized how I can actually apply it in my case.
Thanks for the reply!

Ralph_Smith · July 17, 2017, 12:02am

It looks like you could build a “refinery” for the RegionTrees API where cell data is an array of datapoint locations, and a cell is split only if it is not empty and still too big. While splitting, redistribute datapoints to appropriate children. After the octree is built, walk the tree to build your histogram vectors. I’ll defer to @rdeits as to whether this is a good idea.

rdeits · July 17, 2017, 1:53am

I think that’s exactly the right idea. I’ve only used RegionTrees for one application, so the API may not be quite right yet, but I think it should work.

Datseris · July 17, 2017, 8:50am

@rdeits @Ralph_Smith Care to share some hints or guidelines on how to do it? I don’t see a “good way”. At every single point I have to first check whether a fully developed partition into the lowest size cubes exists? If not I create it, if yes I add value where I need it.

Or you are saying to make the data of the leafs be a vector of 3-Tuples? But then, where do I store the values of the histogram? Remember that each 3-Tuple must map to an integer that corresponds to the amount of datapoints in the corresponding bin (equivalent to the 3-Tuple).

Topic		Replies	Views
Julia equivalent of scipy binned_statistic_dd Statistics question , package	4	908	July 9, 2021
How to iterate through a StatsBase.Histogram General Usage question	3	695	March 1, 2020
Fast Distance Histograms Performance	6	655	April 19, 2021
Efficiently Store and Access Large 3-Dimensional Arrays Performance question , memory , arrays	12	3576	April 22, 2020
Data structure for multidimensional histogram General Usage	7	1641	March 30, 2018

Memory-cheap way to calculate 1D-formated, non-zero only histogram of sparse data?

Related topics