A histogram that averages metadata instead of counts

I have a bunch of coordinates and corresponding values:

n = 100
xy = [rand(2) for _ in 1:n]
v = rand(n)

I want to 2D-bin the coordinates, and calculate the mean of all the values, v, that fall into each corresponding bin.

For example, if only these two coordinate-value pairs fell into the same bin (e.g with edges ([0,0.2), [0,0.2))) :

xy1 = (0.1, 0.1) 
v1 = 0.0

xy2 = (0.01, 0.19)
v2 = 1.0

then I expect that the bin containing them would have the mean of these coordinates’ values: (0 + 1)/2 = 0.5.

Do you know of any “pakaged” operation that can accomplish this?

The only way I can think of is:
In the standard form of a histogram the bin would contain their count (ignoring the values), and in the weighted form of a histogram (see docs here), the bin would contain the sum of the values. So I could calculate the weighted and unweighted histograms, divide the weighted by the unweighted and get what I want:

using StatsBase
x = first.(xy)
y = last.(xy)
edges = (0:0.2:1, 0:0.2:1)
uw = fit(Histogram, (x, y), edges)
w = fit(Histogram, (x, y), weights(v), edges)
h = w.weights./uw.weights
1 Like

Hi @yakir12 you seem to describe something along the lines of the EmpiricalHistogram in GeoStats.jl. It takes into account the coordinates of the samples to correct for clustering effects. It is defined for N-dimensional spatial domains, not only 2D. Suppose you have some data in a vector z and coordinates in a 2D matrix (rows are dimensions (x, y, z, …) and columns are samples). The first thing you need to do is georeference the data:

using GeoStats

# data and coordinates
z = rand(100)
X = rand(2, 100)

# spatial data
𝒟 = georef((z=z,), X)
100 PointSet{Float64,2}
  variables
    └─z (Float64)

Now GeoStats.jl is aware of where the samples are in a given spatial domain and can do very fancy stuff with this dataset that wouldn’t be possible otherwise without the coordinates. You can compute the histogram of the z variable with:

h = EmpiricalHistogram(𝒟, :z)

And optionally pass the size of “bins”:

h = EmpiricalHistogram(𝒟, :z, 0.1)

There is a lot going on in this procedure. If you prefer to do things by hand, you can leverage the partitioning algorithms in the framework. For example, you can partition your spatial dataset with blocks (or bins) of given size:

ℬ = partition(𝒟, BlockPartitioner(0.1,0.2))
50 SpatialPartition
  N° points
  └─2
  └─2
  └─1
  └─2
  └─2
  ⋮
  └─1
  └─1
  └─1
  └─1
  └─1
  metadata: neighbors

And each block is a spatial dataset:

ℬ[1]
2 DomainView{Float64,2}
  variables
    └─z (Float64)

That means you can apply normal functions on a per block basis:

μ₁ = mean(ℬ[1][:z])
0.49119514111824636

μ₂ = mean(ℬ[2][:z])
0.39853597367487636

For more information about this procedures, you can watch the tutorial on spatial declustering:

1 Like

Wow, looks perfect! Thanks!

1 Like

Sorry @juliohm, but I’m not able to get the results I’m after using your excellent GeoStats.

In my implementation:

using StatsBase
n = 100
v = rand(n)
x = rand(n)
y = rand(n)
edges = (0:0.2:1, 0:0.2:1)
uw = fit(Histogram, (x, y), edges)
w = fit(Histogram, (x, y), weights(v), edges)
h1 = w.weights./uw.weights

the end result is a matrix with the means the binned values:

julia> h1 = w.weights./uw.weights
5×5 Array{Float64,2}:
 0.660777  0.614281  0.559278  0.578434  0.506532
 0.508253  0.551131  0.605211  0.459509  0.419349
 0.551241  0.441285  0.374708  0.30565   0.495972
 0.535858  0.613469  0.479945  0.609695  0.569991
 0.404576  0.703201  0.696381  0.596493  0.519485

but all I’m getting from EmpiricalHistogram:

using GeoStats
X = hcat(x, y)'
𝒟 = georef((v=v,), X)
h = EmpiricalHistogram(𝒟, :v)
h2 = h.weights

is a vector:

julia> h2 = h.weights
5-element Array{Float64,1}:
 0.6048951048951048
 0.7972027972027969
 1.0104895104895102
 0.8146853146853144
 0.7727272727272726

I’m sure I’m missing something major…

The EmpiricalHistogram is just a weighted histogram (of counts) with weights computed for adjusting clustering effects. By reading your post again I noticed that you are interested in a histogram where the heights of the bins are the averages of the measurements as opposed to the counts. In this case, the approach with the BlockPartitioner would be appropriate. However, if your solution with built-in histograms works fine, I don’t see a reason to add GeoStats.jl as a dependency, specially if the coordinates in your use case are not geospatial coordinates.

I don’t know what application you have in mind, but since you are computing density ratios (w.weights / uw.weights), I can also share another package that I contributed, which is a dependency of GeoStats.jl: DensityRatioEstimation.jl

Thank you for replying so quickly!

I tried that as well without success. Can you please show me how I’d get the same matrix (using my MWE)?

It would be wonderful to at least check that we get the same results. Adding another dependency is not a major concern of mine in this specific application. And the fact of the matter is that the coordinates are from trajectories in real-world units (cm), so I might find other functionalities in GeoStats useful, if not now, then in the future. The v in my case is speed, how fast the animal is moving along these trajectories.

Seems like an overkill, why would I need to optimize anything here. At the end of the day this is just a for loop checking to see if a datum is in a bin and updating the bin’s content accordingly.

1 Like

It seems that your bins are fixed like in a regular grid as opposed to moving with the data, correct? The BlockPartitioner finds the extent of the spatial dataset and then splits it into blocks. If the extent is changing (like the particles are moving), then the binning would also change. I don’t know if that is acceptable or appropriate for your use case.

In your first post I had understood that you wanted to compute a mean per block. I would do it as follows based on your last snippet of code:

using GeoStats

n = 100
v = rand(n)
x = rand(n)
y = rand(n)

𝒟 = georef((v=v,), hcat(x,y)')

ℬ = partition(𝒟, BlockPartitioner(0.1,0.1))

μs = [mean(b[:v]) for b in ℬ]

You could also reassign the velocity of each particle in a block with:

for b in ℬ
  b[:v] = mean(b[:v])
end

In case the binning needs to be fixed for all iterations of the loop, you could try to fix the spatial domain (like in an image) and activate the pixels of the image whenever a particle is at a given location. That is, you could discretize the problem. Also I am not sure it is a good solution for your application. In case this is something useful, I can point to some code that would do it.