I have a big map-reduce operation on a large radio telescope FFT dataset, which gathers counts to multiple histograms for maybe 11-12 metrics. Now that this is working, I want to see about capturing the intersection of these patterns better.

Just thinking out loud - it seems that my options are either to capture if flat like[bin1ID, bin2ID, bin3ID, ...] and aggregate later, or increment the weights directly into some kind of 12-dimensional tensor. I have clear bins defined by the underlying theories (Poisson point process, etc), so there’s no need to dynamically bin. The map-reduce is parallel, so a remote worker’s data must be merged with a master copy; this works great so far with StatsBase.Histogram.

Any bright ideas for an efficient data structure for this? Thanks for your thoughts.

A combination of JuliaDB and OnlineStats could be useful here: http://juliadb.org/latest/manual/onlinestats.html, in particular the Hist() series from OnlineStats, but I’m not too familiar with OnlineStats, so I can’t be more specific.

If you want to have an object with all possible combination and their counts, you could first add 12 columns to your data to represent the various bin identities and maybe for commodity a columns of ones and then:

should do the trick. As a final data structure, you could choose between a table with bin identities as primary keys or an NDSparse which is probably more suitable for you. You would then access the result with df[bin1, bin2, ...] (you can also use : for slicing). Using reducedim(+,...) you could then get the count for interactions with less terms. Note that, if you start with DNDSparse instead of NDSparse, all these operations will be done in a distributed way.

Thanks @piever - I think I should have mentioned that the original data are very large, about a hundred million rows for six hours of sampling/observation. They’re derivations of the FFT which explode in size.

So I really don’t want it in memory and wonder about whether it’s wise to manipulate whole columns like that. My approach so far is to have a mapper job run through one sample at a time, do the transformations, and save off the histograms to send to reduce. Maybe though your point can be done for individual map worker jobs, getting those counts and sending them back to the reduce mothership.

Even though the syntax makes it look like I’m working with columns, for operations that do not require the whole column (such as reduce and company) this should work out-of-core as well (without loading the whole thing in memory) more or less automatically (though I don’t have much experience with it as I tend to work with smaller datasets).

I’d actually be very curious if you gave it a shot, at least groupreduce(+, t, :bin1ID, select = :ones) should just work out-of-core: http://juliadb.org/latest/manual/out-of-core.html#Processing-scheme-1, though things may get more complicated if the number of groups explodes with all the possible combinations of bin identities for the 12 types of binning. Same goes for adding columns that are a “row-wise” function of other columns (I believe). If it does not work out-of-the box, maybe worth trying @applychunked from JuliaDBMeta

Regarding the storage of the result of the binning, if the outcome is spare (meaning only a very small fraction of the N^12 possible binning combinations is actually used), DNDSparse could be a good way of saving that, even for large datasets.

I think a trie-like structure might be possible, there’s just quite a few possible orderings to choose from. Doing histograms of pairs of histograms 2d bin counts(?) for random samples may show some common pairs, so splitting or grouping in a randomly ordered 12 char trie sense may work. You could also get a sort of binary-pair feature that seems interesting, a bit like a selective two-hot encoding, that could represent 64 independent common-pair true/false values as a Uint64 per row.

Thanks @y4lu - so how would this 12 char trie play out? Say I have a single sample that is represented in 12 variables; I find the bin for all 12, but then where do I increment this 12-d histogram counter? Are you saying there is a 2d counter array in each trie node, and the trie is used to find that array?

How would that compare in speed to a tensor, Array{Int, 12} ?

I think 12 would be too deep, but three or four tries to cover a few dimensions each, maybe, it’s a little bit sketchy. So with a 12 bin sample, it would run a 2d counter (single increment on a bin 1 width x bin 2 width matrix) over each pair of bins, for ~60 increments @ 1/matrix per sample. And then after a number of samples, it should have collected some weighted-graph-like statistics to perhaps base/inform a trie or other structure.

-Added, perhaps the ordering is not as important as I thought, and if so that should make it a bit easier, it could possibly get away with 12 2d count matrices instead of 60. Eg if a1 links to b2 and b2 to c3, both with some number of counts, a1 to c3 may be roughly approximated, though it would need all a1 to bx, bx to c3 counts, not just b2 (possibly less relevent for tries). For a 4x4x4x4 set, memory requirement would be between 3x4x4 and 6x4x4, so there’s probably quite a saving there as the dimensions grow

This is a bit of a hybrid bloom-filter / multi-trie like base structure
Has a fast-ish? / short-circuit test of membership per set of bloom tables
It should be similar to a proper bloom filter minus the hashing when the tables are overlayed (sparseness
requirements?). May suffer from a lack of precise recall given a set of trie/bloom-table memberships, will probably tell you it’s one of x value combos
Testing with windmill data from https://challengedata.ens.fr/en/home