Plots.jl - Combined Histogram and Distribution

I’m trying to convert some plots from R based examples to Julia.

The R examples are found here:

Essentially the single plot is a histogram of a probability distribution, with the corresponding Beta distribution plotted on top.

I’ve managed to do something similar in Plots.jl, but not quite…

My data is horse racing related, rather than baseball, and looks like this:

julia> trainers_sr_hundred_runs
380×4 DataFrame
│ Row │ trainer           │ runs  │ wins  │ sr        │
│     │ String            │ Int64 │ Int64 │ Float64   │
├─────┼───────────────────┼───────┼───────┼───────────┤
│ 1   │ A P OBrien        │ 394   │ 69    │ 0.175127  │
│ 2   │ Ed Dunlop         │ 1711  │ 198   │ 0.115722  │
│ 3   │ Sir Henry Cecil   │ 651   │ 128   │ 0.196621  │
⋮
│ 377 │ Mark Walford      │ 190   │ 14    │ 0.0736842 │
│ 378 │ Rebecca Bastiman  │ 113   │ 8     │ 0.0707965 │
│ 379 │ Heather Dalton    │ 146   │ 11    │ 0.0753425 │
│ 380 │ Sarah Hollinshead │ 109   │ 6     │ 0.0550459 │

Code is:

using DataFrames
using Distributions
using Plots
using StatsPlots

# Plot histogram of trainers strike rate, more than 100 runs
gr()
histogram(trainers_sr_hundred_runs.sr, label = "Strike Rate")

# Fit Beta distirbution to find alpha and beta
alpha_beta = fit(Beta, trainers_sr_hundred_runs.sr)
# Store alpha and beta in a named tuple
alpha_beta_tup = Distributions.params(alpha_beta)

# Plot the Beta distribution over the histogram
plot!(Beta(alpha_beta_tup[1], alpha_beta_tup[2]))

Created plot looks like:

julia_hist_beta

The problem looks like the Beta distribution is continuing out to 1, which makes some sense, but there’s no histogram values out there. So, perhaps the question is, how do I limit the Beta distribution plot to end at the largest value in the histogram/distribution.

Secondly, in the R examples, you’ll notice the y-axis of the combined plots is density rather than counts. I tried in Plots.jl starting with the Beta distribution and then the histogram, but it didn’t seem to make much difference. How does one set the y-axis to correspond with the second plot?

2 Likes

I think you probably have to build the histogram with

data_hist = fit(Histogram, data)

and then get the lowest and highest bins from it.

On the histogram, you need to pass the kwarg normalize = true to get it onto the same scale as the pmf plotted by StatsPlots.

On the truncation, you can just pass xlim to your plot, or you can actually truncate the distribution.

Below an example where I used the Lahmann data mentioned in the SO post, downloaded from here:

julia> using CSV, DataFrames, Distributions, StatsPlots

julia> batting = CSV.read("Batting.csv", DataFrame);

julia> batting = batting[batting.AB .> 500, :];

julia> data = combine(groupby(batting, :playerID), [:H, :AB] => ((h, ab) -> sum(h)/sum(ab)) => :average);

julia> fit_result = fit(Beta, data.average);

julia> histogram(data.average, label = "Data", linecolor = "white", normalize = true);

julia> plot!(fit_result, label = "Beta(91.8, 237.4)", xlim = (minimum(data.average)-0.02, maximum(data.average)+0.02))

which gives:

image

alternatively, you could have plotted Trunacted(fit_result, minimum(data.average), maximum(data.average))

6 Likes

That looks pretty amazing and exactly what I was aiming for. I bet it took you all of ten minutes, as opposed to the several hours I scratched my head over this. Thank you.

1 Like

Tbf, the only trick to make it look good is to set linecolor = "white" in the histogram :slight_smile: (although I think it isn’t totally exact anymore at that point, as the bar height is calculated taking the linewidth into account iirc)

Oh and maybe linewidth = 2 on the density plot would look better as well!

1 Like