I’ve noticed that I run out of memory extremely quickly when calculating a high-resolution histogram of three dimensional data (using StatsBase). I guess the reason is that the size of the output is proportional to the product of the length of each individual edges (I provide the edges), which skyrockets very quickly due to the required high-resolution.
I am sure I can formulate a better method because the data output I need is very specialized: I only need the non-zero elements of the histogram, and in no particular order. Therefore for my output I only need a 1D Vector (instead of 3D Array) and I do not need to save the information for the edges, or which point in the grid each histogram entry corresponds to, etc. In addition, the data provided is sparse in the sense that more than half the entries of the 3D Histogram are zero. (I do NOT have
I now want to create such an algorithm/approach, but I am wondering what is the best way to do it:
One approach I thought is to create a dictionary, and have the triplet of indexes (that is created by the place in the edges) be a key of the dictionary. Therefore in each step I will have to check whether this key exists, and if it does then I add +1 to its value. After that, I simply collect all the values and make em a 1D array.
The thing is, I got a hunch that this will be very slow? But I am also searching for other opinions before I implement this, maybe somebody got a smarter approach.