Plots: How to create a histogram such that sum of bar heights =1


#1

I want to use the Plots package to create a histogram such that the heights of the bars sum to one (preferably for each series). I was hopeful that normalize=true would do the trick, but its goal is to let the sum of the areas of the bar =1, i.e. produce the equivalent of a PDF.

 histogram([[1,2,2,3,3,3],[6,6,6,6,7,7,7,7,7,8]] something)

should thus produce a histogram with two series. In the first series, there should be a bar at 1 with height 1/6, a bar at 2 with height 1/3 and a bar at 3 with height 1/2. Similarly for the second series there should be bar at 6 with height 0.4, a bar at 7 with height 0.5 and a bar at 8 with height 0.1.

I’m hoping there’s some simple way to do this. I looked at StatPlots but there did not seem to be something that exactly matched what I want.


#2

With PyPlot, you can use the weights keyword to accomplish this:

using PyPlot
weighted_hist(x; kws...) = PyPlot.plt[:hist](x; weights=ones(length(x))/length(x), kws...)

weighted_hist([1,2,2,3,3,3])
weighted_hist([6,6,6,6,7,7,7,7,7,8])

The next release of Matplotlib apparently will have a density=true keyword to accomplish this, but I’m not sure when it will be released.


#3

Very nice. Plots does not have a weights keyword for a histogram - it essentially supports the normalization modes that StatsBase does. I’ve never heard of the heights-sum-to-1 normalization before, but if it’s statistically valid I don’t see why we couldn’t add the option to normalize (@oschulz , what’s your opinion on this?).

@stevengj , wouldn’t the density=true keyword lead to the areas of bars, rather than the heights, equalling the samples in each bin? (like the Plots option :density)


#4

This seems to work well as of now:

PyPlot.plt[:hist](x,normed=true)

which the matplotlib manual says should accomplish

…the integral of the histogram will sum to 1.

/Paul S


#5

Thanks for the help here. I would love it if Plots could incorporate a weights factor. As additional inducement, one of your proprietary colleagues (Mathematica) allows a variety of “hspec” arguments that modify the heights of the bars. These include Count (which is the Julia Plots default), PDF ( which I believe corresponds to Julia Plots normalize=1), Probability (which is what I want), and a bunch of others, including an arbitrary transformation of bar heights via a function. You can look at the Mathematica documentation here at http://reference.wolfram.com/language/ref/Histogram.html. Not, of course, that Julia plots has to copy the approach, but it does suggest that some people have found a broader set of options to be useful.


#6

Thanks. The problem is that I don’t want the integral to sum to 1. I want the bar heights to sum to one. If the bars lie on things other than the [0,1] interval, the two things will not be the same. See my earlier comments about the options offered in Mathematica. What I basically want is for someone to look at the height of a bar and say, “Oh, I see that 37% of the entities in that group had x-values around whatever.” As it stands, in Plots, when normalize = false, you get an absolute count. That’s fine sometimes, but other times, particularly when you have grouped data, you want a fraction rather than an absolute count. And in Plots, when normalize=true, it looks like you get a bar height such that the integral of the bars is equal to 1, which will not help the person who wants to know what percentage of the entities in a particular group had some x-value.

I’m going to try to some PNGs from a Mathematica notebook that tries to explain the issue better.
mathematica histogram 1
mathematica histogram 2mathematica histogram 3!


#7

Perhaps the easiest way right now is to code up the histogram calculations
(a isInBin function…) and then use a bar plot (scaled in whatever way you
want)? Paul S


#8

The easiest at the moment is to just use PyPlot, which has the options :slight_smile: But of course there is always the approach of fitting a StatsBase.Histogram, extracting the weights field and transforming. In fact that is what Plots does behind the scenes.


#9

I should add that in addition to true and false, normalize takes :pdf, :density and :none. So the option is open to add :percentage or whatever would be a good name.


#10

Presumably someone already wrote up the histogram binning computations as part of Plots or some package it relies on. Anyone know where that code is located so I can avoid wheel reinvention?


#11

Yes, @oschulz did that. I’d like to hear his opinion before comitting to adding this, as he spent a very long time thinking about the right options. The normalization code is here https://github.com/JuliaPlots/Plots.jl/blob/master/src/recipes.jl#L539-L549


#12

If, like in the example, the histogram has equal width bars, then the only change necessary is to the y-axis. The easiest way would be to take the y-axis and change it manually to a rescaled one (the factor is simply the width of a bar in the histogram, since the histogram is normalized area and height = area / width-of-bar).


#13

Presumably someone already wrote up the histogram binning computations

Do you mean something like StatsBase.fit(Histogram,…)?


#14

@Seth_Chandler: As @mkborregaard already mentioned, Plots now relies on StatsBase for histogram building, including normalization, and therefore supports the normalization modes :none, :pdf and :density.

What you’re describing is something that’s probably not very meaningful, statistically, for a “true” histogram (meaning a categorization based on bins of equal or even different, and possibly automatically chosen, size), but for a typical “bar plot” where the categories might not even be numerical. I don’t think a sum-heights-to-one normalization, as you describe, would be accepted for StatsBase.Histogram : The StatsBase maintainers put a strong emphasis on mathematical rigor, and for good reason. A normalization that does not take bin width into account is not very meaningful for a “true” histogram, as the bin size is somewhat arbitrary (usually chosen depending on number of data points and other criteria).

We had a long discussion about different kinds of “histograms-like” plots in the past (see https://github.com/JuliaPlots/Plots.jl/issues/223#issuecomment-232096408). Basically, “true” bin-width based histograms and other categorizations are different things, semantically. And either can be visualized as step-like or bar-like, but in most cases, the one fits better for the first, and the second for the other.

Currently, Plots doesn’t really have great support for histogram-like categorizations that are not based on bin sizes - though maybe it should. IMHO this would require careful consideration and a clean concept behind it. @Seth_Chandler could you tell a bit more about your typical application(s) and the kind of categories you’re dealing with?


#15

We could consider taking the discussion in a Plots issue?

Anyway, until a new keyword is in place, the easiest approach is either to use @stevengj 's PyPlot method or to do a Plots workaround, either

  1. by fitting a StatsBase histogram, modifying the weights field and passing to plot (there is a recipe for that)
  2. plotting the histogram as normal and specifying different yticks or
  3. doing something hacky with modifying the plot object:
function prophist(x; kw...)
    s = length(x)
    h = histogram(x; kw...)
    h[1][1][:y] ./= s
    h[1][:yaxis].d[:extrema].emax /= s   #also change the ylimits
    h
end

prophist(randn(10000), bins = :scott)

#16

Happy to do so. First, let me thank everyone for their thoughtful responses to my query. I’m new to Julia and so a welcoming community is important to me. I come from 20+ years of working with Mathematica, some work with R, and am trying to diversify my language portfolio.

Let me tell you the context in which my request arose.

I do a lot of research on the Affordable Care Act. I have a dataframe in which we have columns such as (1) the year, (2) the geographic rating area, (3) the age of a person living in that geographic rating area, (4) the income of a person living in that geographic rating area and (5) the percentage of that person’s income they would need to contribute in order to purchase the second cheapest “Silver” plan sold in that geographic rating area.

I want to produce a graphic that basically compares the distribution of Column 5 among different values of Column 1. For what it’s worth, I particularly want to compare the distribution in 2014 and the distribution in 2017. I want to do so for various age-income subsets of the population. An example would be people who are 60 years old and whose income is 4.5 times the federal poverty level.

I’m now going to show you the graphic as Mathematica produces it. I do so not because I think that Julia has to replicate every feature of Mathematica but to suggest that some serious people think what I want to do is legitimate. (I’ll also show the code just because I think it’s interesting to see a lot of similarities between Julia Plots and Mathematica.)

 Histogram[
 Query[Select[#age == 60 && #fpl == 4.5 &] /* GroupBy[#year &], 
 All, #"contribution_pct" &][df], Automatic, "Probability", 
ChartLegends -> Automatic,
Frame -> True,
FrameLabel -> {"gross premium\nas fraction of income", 
"fraction of rating areas"}, 
PlotLabel -> "Fay: A 60 year old earning 450% FPL",
PlotTheme -> "Detailed"]

fayhistogram

The problem with using absolute counts on the y-axis is that there were fewer plans sold in 2017 than in 2014. Therefore, in my opinion, a graphic that uses absolute counts would confuse the extent to which there has been a rightwards shift in the distribution in question between 2014 and 2017.

I also don’t think normalization=true will be particularly communicative to my audience in that the y-axis values will not have great meaning to them. The values on the y-axis will be things like 15 and 20 because the x-domain is small. Those values don’t really have any clear meaning in this context to economists/policy makers.

What I do think makes sense is the graphic above. One can read it to see that about 42% of rating areas in 2014 required the 60 year old in question to pay about 13% of their income to purchase the second lowest silver plan and that in about 10% of the rating areas in 2017 that same 60 year old is required to pay about 23% of their income to do the same thing. Regardless of what one may think about the politics of it all, those numbers communicate the point I am trying to make.

To generalize, the situation in which I believe a sum of the bars = 1 for each distribution would make sense is one in which one is comparing two discrete distributions, particularly ones that have different sized domains and particularly ones in which the number of values from which each distribution is derived differ. Also, a traditional bar graph may not work because the x-axis should be numeric, not categorical. (Maybe there is some option to the bar graph plotting routines that I did not notice??)

As I said, I am new to Julia and there may well already be a way of doing what I want. If, not, though, it seems (to me) a sensible thing to desire. And, given my admiration for the Plots package and its cousins, that would seem a helpful place in which to put the functionality.

Thanks.


#17

I find that argument convincing. It is intuitively just the proportion of values in each bin, rather than the count. Another important difference to :pdf is that these wouldn’t be controlled for bin width for uneven bins (larger bins would have a bigger proportion).
I’d suggest trying to open a PR on StatsBase with this argumentation (the code is here: https://github.com/JuliaStats/StatsBase.jl/blob/master/src/hist.jl#L387-L433) - if they are against I’d suggest we could put a workaround in Plots, but the other seems like the right way to go.


#18

That sounds sensible but I am not familiar enough with the Julia development process to do this. (Like I don’t know what a PR is) Might someone else take the lead?


#19

A PR is a pull request - here it is: https://github.com/JuliaStats/StatsBase.jl/pull/293


#20

It’s indeed something people may want to plot. Personally, I would use PDF normalization in @Seth_Chandler’s example above, so that the scale of the y-axis wouldn’t depend on the bin width chosen for “gross premium” (as this in a arbitrary choice). I guess it depends on the target audience though - I guess in cases like this, said audience might include at least some people with a less, ah, statistics-oriented mind set. :wink:

I’m skeptical on whether a normalization like “sum(bin_values) == 1” would be accepted into StatsBase, because it has little mathematical meaning (@nalimilan, @ararslan what’s your take on this?). Personally, I wouldn’t be opposed to it, as I don’t think it would hurt to have it available.

If not in StatsBase, we could offer it as an additional normalization in Plots - but it’s a bit of a hacky solution, since we now have this nice streamlined integration with StatsBase.Histogram. We can certainly think about it, though.

In any case, I can offer a quick solution that’ll work right now:

using StatsBase, Plots
data = 0.2 + randn(1000)/20
h = fit(Histogram{Float64}, data, 0:0.025:0.45, closed = :left)
h.weights ./= sum(h.weights)
plot(h)

It’s not a one-liner, but maybe it would be good enough for the moment, Seth? It has the additional advantage that you’ll have the histogram object available, to extract numerical values (i.e. weights / bin-heights) from.