# Plots: How to create a histogram such that sum of bar heights =1

I want to use the Plots package to create a histogram such that the heights of the bars sum to one (preferably for each series). I was hopeful that normalize=true would do the trick, but its goal is to let the sum of the areas of the bar =1, i.e. produce the equivalent of a PDF.

`````` histogram([[1,2,2,3,3,3],[6,6,6,6,7,7,7,7,7,8]] something)
``````

should thus produce a histogram with two series. In the first series, there should be a bar at 1 with height 1/6, a bar at 2 with height 1/3 and a bar at 3 with height 1/2. Similarly for the second series there should be bar at 6 with height 0.4, a bar at 7 with height 0.5 and a bar at 8 with height 0.1.

Iâ€™m hoping thereâ€™s some simple way to do this. I looked at StatPlots but there did not seem to be something that exactly matched what I want.

1 Like

With PyPlot, you can use the weights keyword to accomplish this:

``````using PyPlot
weighted_hist(x; kws...) = PyPlot.plt[:hist](x; weights=ones(length(x))/length(x), kws...)

weighted_hist([1,2,2,3,3,3])
weighted_hist([6,6,6,6,7,7,7,7,7,8])
``````

The next release of Matplotlib apparently will have a `density=true` keyword to accomplish this, but Iâ€™m not sure when it will be released.

1 Like

Very nice. Plots does not have a `weights` keyword for a histogram - it essentially supports the normalization modes that StatsBase does. Iâ€™ve never heard of the heights-sum-to-1 normalization before, but if itâ€™s statistically valid I donâ€™t see why we couldnâ€™t add the option to `normalize` (@oschulz , whatâ€™s your opinion on this?).

@stevengj , wouldnâ€™t the `density=true` keyword lead to the areas of bars, rather than the heights, equalling the samples in each bin? (like the Plots option :density)

This seems to work well as of now:

``````PyPlot.plt[:hist](x,normed=true)
``````

which the matplotlib manual says should accomplish

â€¦the integral of the histogram will sum to 1.

/Paul S

Thanks for the help here. I would love it if Plots could incorporate a weights factor. As additional inducement, one of your proprietary colleagues (Mathematica) allows a variety of â€śhspecâ€ť arguments that modify the heights of the bars. These include Count (which is the Julia Plots default), PDF ( which I believe corresponds to Julia Plots normalize=1), Probability (which is what I want), and a bunch of others, including an arbitrary transformation of bar heights via a function. You can look at the Mathematica documentation here at http://reference.wolfram.com/language/ref/Histogram.html. Not, of course, that Julia plots has to copy the approach, but it does suggest that some people have found a broader set of options to be useful.

Thanks. The problem is that I donâ€™t want the integral to sum to 1. I want the bar heights to sum to one. If the bars lie on things other than the [0,1] interval, the two things will not be the same. See my earlier comments about the options offered in Mathematica. What I basically want is for someone to look at the height of a bar and say, â€śOh, I see that 37% of the entities in that group had x-values around whatever.â€ť As it stands, in Plots, when normalize = false, you get an absolute count. Thatâ€™s fine sometimes, but other times, particularly when you have grouped data, you want a fraction rather than an absolute count. And in Plots, when normalize=true, it looks like you get a bar height such that the integral of the bars is equal to 1, which will not help the person who wants to know what percentage of the entities in a particular group had some x-value.

Iâ€™m going to try to some PNGs from a Mathematica notebook that tries to explain the issue better.

!

Perhaps the easiest way right now is to code up the histogram calculations
(a isInBin functionâ€¦) and then use a bar plot (scaled in whatever way you
want)? Paul S

1 Like

The easiest at the moment is to just use PyPlot, which has the options But of course there is always the approach of fitting a StatsBase.Histogram, extracting the `weights` field and transforming. In fact that is what Plots does behind the scenes.

I should add that in addition to `true` and `false`, normalize takes `:pdf`, `:density` and `:none`. So the option is open to add `:percentage` or whatever would be a good name.

Presumably someone already wrote up the histogram binning computations as part of Plots or some package it relies on. Anyone know where that code is located so I can avoid wheel reinvention?

Yes, @oschulz did that. Iâ€™d like to hear his opinion before comitting to adding this, as he spent a very long time thinking about the right options. The normalization code is here https://github.com/JuliaPlots/Plots.jl/blob/master/src/recipes.jl#L539-L549

If, like in the example, the histogram has equal width bars, then the only change necessary is to the y-axis. The easiest way would be to take the y-axis and change it manually to a rescaled one (the factor is simply the width of a bar in the histogram, since the histogram is normalized area and height = area / width-of-bar).

Presumably someone already wrote up the histogram binning computations

Do you mean something like StatsBase.fit(Histogram,â€¦)?

@Seth_Chandler: As @mkborregaard already mentioned, Plots now relies on StatsBase for histogram building, including normalization, and therefore supports the normalization modes `:none`, `:pdf` and `:density`.

What youâ€™re describing is something thatâ€™s probably not very meaningful, statistically, for a â€śtrueâ€ť histogram (meaning a categorization based on bins of equal or even different, and possibly automatically chosen, size), but for a typical â€śbar plotâ€ť where the categories might not even be numerical. I donâ€™t think a sum-heights-to-one normalization, as you describe, would be accepted for `StatsBase.Histogram` : The StatsBase maintainers put a strong emphasis on mathematical rigor, and for good reason. A normalization that does not take bin width into account is not very meaningful for a â€śtrueâ€ť histogram, as the bin size is somewhat arbitrary (usually chosen depending on number of data points and other criteria).

We had a long discussion about different kinds of â€śhistograms-likeâ€ť plots in the past (see https://github.com/JuliaPlots/Plots.jl/issues/223#issuecomment-232096408). Basically, â€śtrueâ€ť bin-width based histograms and other categorizations are different things, semantically. And either can be visualized as step-like or bar-like, but in most cases, the one fits better for the first, and the second for the other.

Currently, Plots doesnâ€™t really have great support for histogram-like categorizations that are not based on bin sizes - though maybe it should. IMHO this would require careful consideration and a clean concept behind it. @Seth_Chandler could you tell a bit more about your typical application(s) and the kind of categories youâ€™re dealing with?

1 Like

We could consider taking the discussion in a Plots issue?

Anyway, until a new keyword is in place, the easiest approach is either to use @stevengj 's PyPlot method or to do a Plots workaround, either

1. by fitting a StatsBase histogram, modifying the `weights` field and passing to `plot` (there is a recipe for that)
2. plotting the histogram as normal and specifying different `yticks` or
3. doing something hacky with modifying the plot object:
``````function prophist(x; kw...)
s = length(x)
h = histogram(x; kw...)
h[1][1][:y] ./= s
h[1][:yaxis].d[:extrema].emax /= s   #also change the ylimits
h
end

prophist(randn(10000), bins = :scott)
``````

Happy to do so. First, let me thank everyone for their thoughtful responses to my query. Iâ€™m new to Julia and so a welcoming community is important to me. I come from 20+ years of working with Mathematica, some work with R, and am trying to diversify my language portfolio.

Let me tell you the context in which my request arose.

I do a lot of research on the Affordable Care Act. I have a dataframe in which we have columns such as (1) the year, (2) the geographic rating area, (3) the age of a person living in that geographic rating area, (4) the income of a person living in that geographic rating area and (5) the percentage of that personâ€™s income they would need to contribute in order to purchase the second cheapest â€śSilverâ€ť plan sold in that geographic rating area.

I want to produce a graphic that basically compares the distribution of Column 5 among different values of Column 1. For what itâ€™s worth, I particularly want to compare the distribution in 2014 and the distribution in 2017. I want to do so for various age-income subsets of the population. An example would be people who are 60 years old and whose income is 4.5 times the federal poverty level.

Iâ€™m now going to show you the graphic as Mathematica produces it. I do so not because I think that Julia has to replicate every feature of Mathematica but to suggest that some serious people think what I want to do is legitimate. (Iâ€™ll also show the code just because I think itâ€™s interesting to see a lot of similarities between Julia Plots and Mathematica.)

`````` Histogram[
Query[Select[#age == 60 && #fpl == 4.5 &] /* GroupBy[#year &],
All, #"contribution_pct" &][df], Automatic, "Probability",
ChartLegends -> Automatic,
Frame -> True,
FrameLabel -> {"gross premium\nas fraction of income",
"fraction of rating areas"},
PlotLabel -> "Fay: A 60 year old earning 450% FPL",
PlotTheme -> "Detailed"]
``````

The problem with using absolute counts on the y-axis is that there were fewer plans sold in 2017 than in 2014. Therefore, in my opinion, a graphic that uses absolute counts would confuse the extent to which there has been a rightwards shift in the distribution in question between 2014 and 2017.

I also donâ€™t think normalization=true will be particularly communicative to my audience in that the y-axis values will not have great meaning to them. The values on the y-axis will be things like 15 and 20 because the x-domain is small. Those values donâ€™t really have any clear meaning in this context to economists/policy makers.

What I do think makes sense is the graphic above. One can read it to see that about 42% of rating areas in 2014 required the 60 year old in question to pay about 13% of their income to purchase the second lowest silver plan and that in about 10% of the rating areas in 2017 that same 60 year old is required to pay about 23% of their income to do the same thing. Regardless of what one may think about the politics of it all, those numbers communicate the point I am trying to make.

To generalize, the situation in which I believe a sum of the bars = 1 for each distribution would make sense is one in which one is comparing two discrete distributions, particularly ones that have different sized domains and particularly ones in which the number of values from which each distribution is derived differ. Also, a traditional bar graph may not work because the x-axis should be numeric, not categorical. (Maybe there is some option to the bar graph plotting routines that I did not notice??)

As I said, I am new to Julia and there may well already be a way of doing what I want. If, not, though, it seems (to me) a sensible thing to desire. And, given my admiration for the Plots package and its cousins, that would seem a helpful place in which to put the functionality.

Thanks.

3 Likes

I find that argument convincing. It is intuitively just the proportion of values in each bin, rather than the count. Another important difference to :pdf is that these wouldnâ€™t be controlled for bin width for uneven bins (larger bins would have a bigger proportion).
Iâ€™d suggest trying to open a PR on StatsBase with this argumentation (the code is here: https://github.com/JuliaStats/StatsBase.jl/blob/master/src/hist.jl#L387-L433) - if they are against Iâ€™d suggest we could put a workaround in Plots, but the other seems like the right way to go.

That sounds sensible but I am not familiar enough with the Julia development process to do this. (Like I donâ€™t know what a PR is) Might someone else take the lead?

1 Like

A PR is a pull request - here it is: https://github.com/JuliaStats/StatsBase.jl/pull/293

Itâ€™s indeed something people may want to plot. Personally, I would use PDF normalization in @Seth_Chandlerâ€™s example above, so that the scale of the y-axis wouldnâ€™t depend on the bin width chosen for â€śgross premiumâ€ť (as this in a arbitrary choice). I guess it depends on the target audience though - I guess in cases like this, said audience might include at least some people with a less, ah, statistics-oriented mind set.

Iâ€™m skeptical on whether a normalization like â€śsum(bin_values) == 1â€ť would be accepted into StatsBase, because it has little mathematical meaning (@nalimilan, @ararslan whatâ€™s your take on this?). Personally, I wouldnâ€™t be opposed to it, as I donâ€™t think it would hurt to have it available.

If not in StatsBase, we could offer it as an additional normalization in Plots - but itâ€™s a bit of a hacky solution, since we now have this nice streamlined integration with StatsBase.Histogram. We can certainly think about it, though.

In any case, I can offer a quick solution thatâ€™ll work right now:

``````using StatsBase, Plots
data = 0.2 + randn(1000)/20
h = fit(Histogram{Float64}, data, 0:0.025:0.45, closed = :left)
h.weights ./= sum(h.weights)
plot(h)
``````

Itâ€™s not a one-liner, but maybe it would be good enough for the moment, Seth? It has the additional advantage that youâ€™ll have the histogram object available, to extract numerical values (i.e. weights / bin-heights) from.

2 Likes