A histogram that averages metadata instead of counts

yakir12 · August 20, 2020, 9:47am

I have a bunch of coordinates and corresponding values:

n = 100
xy = [rand(2) for _ in 1:n]
v = rand(n)

I want to 2D-bin the coordinates, and calculate the mean of all the values, v, that fall into each corresponding bin.

For example, if only these two coordinate-value pairs fell into the same bin (e.g with edges ([0,0.2), [0,0.2))) :

xy1 = (0.1, 0.1) 
v1 = 0.0

xy2 = (0.01, 0.19)
v2 = 1.0

then I expect that the bin containing them would have the mean of these coordinates’ values: (0 + 1)/2 = 0.5.

Do you know of any “pakaged” operation that can accomplish this?

The only way I can think of is:
In the standard form of a histogram the bin would contain their count (ignoring the values), and in the weighted form of a histogram (see docs here), the bin would contain the sum of the values. So I could calculate the weighted and unweighted histograms, divide the weighted by the unweighted and get what I want:

using StatsBase
x = first.(xy)
y = last.(xy)
edges = (0:0.2:1, 0:0.2:1)
uw = fit(Histogram, (x, y), edges)
w = fit(Histogram, (x, y), weights(v), edges)
h = w.weights./uw.weights

juliohm · August 20, 2020, 10:49am

Hi @yakir12 you seem to describe something along the lines of the EmpiricalHistogram in GeoStats.jl. It takes into account the coordinates of the samples to correct for clustering effects. It is defined for N-dimensional spatial domains, not only 2D. Suppose you have some data in a vector z and coordinates in a 2D matrix (rows are dimensions (x, y, z, …) and columns are samples). The first thing you need to do is georeference the data:

using GeoStats

# data and coordinates
z = rand(100)
X = rand(2, 100)

# spatial data
𝒟 = georef((z=z,), X)

100 PointSet{Float64,2}
  variables
    └─z (Float64)

Now GeoStats.jl is aware of where the samples are in a given spatial domain and can do very fancy stuff with this dataset that wouldn’t be possible otherwise without the coordinates. You can compute the histogram of the z variable with:

h = EmpiricalHistogram(𝒟, :z)

And optionally pass the size of “bins”:

h = EmpiricalHistogram(𝒟, :z, 0.1)

There is a lot going on in this procedure. If you prefer to do things by hand, you can leverage the partitioning algorithms in the framework. For example, you can partition your spatial dataset with blocks (or bins) of given size:

ℬ = partition(𝒟, BlockPartitioner(0.1,0.2))

50 SpatialPartition
  N° points
  └─2
  └─2
  └─1
  └─2
  └─2
  ⋮
  └─1
  └─1
  └─1
  └─1
  └─1
  metadata: neighbors

And each block is a spatial dataset:

ℬ[1]

2 DomainView{Float64,2}
  variables
    └─z (Float64)

That means you can apply normal functions on a per block basis:

μ₁ = mean(ℬ[1][:z])
0.49119514111824636

μ₂ = mean(ℬ[2][:z])
0.39853597367487636

For more information about this procedures, you can watch the tutorial on spatial declustering:

yakir12 · August 20, 2020, 10:52am

Wow, looks perfect! Thanks!

yakir12 · August 20, 2020, 11:37am

Sorry @juliohm, but I’m not able to get the results I’m after using your excellent GeoStats.

In my implementation:

using StatsBase
n = 100
v = rand(n)
x = rand(n)
y = rand(n)
edges = (0:0.2:1, 0:0.2:1)
uw = fit(Histogram, (x, y), edges)
w = fit(Histogram, (x, y), weights(v), edges)
h1 = w.weights./uw.weights

the end result is a matrix with the means the binned values:

julia> h1 = w.weights./uw.weights
5×5 Array{Float64,2}:
 0.660777  0.614281  0.559278  0.578434  0.506532
 0.508253  0.551131  0.605211  0.459509  0.419349
 0.551241  0.441285  0.374708  0.30565   0.495972
 0.535858  0.613469  0.479945  0.609695  0.569991
 0.404576  0.703201  0.696381  0.596493  0.519485

but all I’m getting from EmpiricalHistogram:

using GeoStats
X = hcat(x, y)'
𝒟 = georef((v=v,), X)
h = EmpiricalHistogram(𝒟, :v)
h2 = h.weights

is a vector:

julia> h2 = h.weights
5-element Array{Float64,1}:
 0.6048951048951048
 0.7972027972027969
 1.0104895104895102
 0.8146853146853144
 0.7727272727272726

I’m sure I’m missing something major…

juliohm · August 20, 2020, 12:00pm

The EmpiricalHistogram is just a weighted histogram (of counts) with weights computed for adjusting clustering effects. By reading your post again I noticed that you are interested in a histogram where the heights of the bins are the averages of the measurements as opposed to the counts. In this case, the approach with the BlockPartitioner would be appropriate. However, if your solution with built-in histograms works fine, I don’t see a reason to add GeoStats.jl as a dependency, specially if the coordinates in your use case are not geospatial coordinates.

I don’t know what application you have in mind, but since you are computing density ratios (w.weights / uw.weights), I can also share another package that I contributed, which is a dependency of GeoStats.jl: DensityRatioEstimation.jl

yakir12 · August 20, 2020, 12:09pm

Thank you for replying so quickly!

I tried that as well without success. Can you please show me how I’d get the same matrix (using my MWE)?

It would be wonderful to at least check that we get the same results. Adding another dependency is not a major concern of mine in this specific application. And the fact of the matter is that the coordinates are from trajectories in real-world units (cm), so I might find other functionalities in GeoStats useful, if not now, then in the future. The v in my case is speed, how fast the animal is moving along these trajectories.

Seems like an overkill, why would I need to optimize anything here. At the end of the day this is just a for loop checking to see if a datum is in a bin and updating the bin’s content accordingly.

juliohm · August 20, 2020, 12:59pm

It seems that your bins are fixed like in a regular grid as opposed to moving with the data, correct? The BlockPartitioner finds the extent of the spatial dataset and then splits it into blocks. If the extent is changing (like the particles are moving), then the binning would also change. I don’t know if that is acceptable or appropriate for your use case.

In your first post I had understood that you wanted to compute a mean per block. I would do it as follows based on your last snippet of code:

using GeoStats

n = 100
v = rand(n)
x = rand(n)
y = rand(n)

𝒟 = georef((v=v,), hcat(x,y)')

ℬ = partition(𝒟, BlockPartitioner(0.1,0.1))

μs = [mean(b[:v]) for b in ℬ]

You could also reassign the velocity of each particle in a block with:

for b in ℬ
  b[:v] = mean(b[:v])
end

In case the binning needs to be fixed for all iterations of the loop, you could try to fix the spatial domain (like in an image) and activate the pixels of the image whenever a particle is at a given location. That is, you could discretize the problem. Also I am not sure it is a good solution for your application. In case this is something useful, I can point to some code that would do it.

Topic		Replies	Views
Hist() and hits() in Julia? Statistics question	2	3998	December 30, 2016
Julia equivalent of scipy binned_statistic_dd Statistics question , package	4	908	July 9, 2021
Spatial binning with julia GeoStats packages General Usage geostatistics	3	507	April 17, 2022
2d Histogram? New to Julia graphics	1	2251	January 21, 2020
BinStatistics.jl: Highly flexible and efficient computation of n-dimensional binned statistic(s) for n-variable(s) Package Announcements	0	253	February 27, 2023

A histogram that averages metadata instead of counts

Related topics