Bug in StatsPlots' call to groupedhist with normalize=:true?

Hi there,
Consider the following data.csv file:

Class_Name Grade
Turma 6	4.2
Turma 4	3.5
Turma 6	0.2
Turma 3 Especial	1.6
Turma 2 Piloto	7.8
Turma 4	1.4
Turma 5	1.6
Turma 6	3.8000000000000003
Turma 6	1.5
Turma 6	5.800000000000001
Turma 6	7.8
Turma 2 Piloto	3.3
Turma 2 Piloto	0.8
Turma 3 Especial	0.0
Turma 6	8.9
Turma 4	3.0
Turma 2 Piloto	5.0
Turma 6	1.1
Turma 5	5.2
Turma 6	4.2
Turma 6	7.1
Turma 2 Piloto	5.7
Turma 2 Piloto	0.8
Turma 5	4.0
Turma 6	3.5999999999999996
Turma 6	0.1
Turma 6	3.8000000000000003
Turma 1	3.3
Turma 6	4.0
Turma 2 Piloto	1.6
Turma 4	8.5
Turma 3 Especial	0.9
Turma 6	2.5
Turma 1	3.5
Turma 4	4.1
Turma 4	0.8
Turma 6	2.2
Turma 2 Piloto	1.7000000000000002
Turma 5	2.4
Turma 6	3.6
Turma 6	3.0
Turma 5	0.8
Turma 1	2.2
Turma 2 Piloto	1.6
Turma 4	1.6
Turma 5	2.1
Turma 3 Especial	1.7000000000000002
Turma 6	8.2
Turma 5	2.6
Turma 6	3.4000000000000004
Turma 4	2.7
Turma 6	4.800000000000001
Turma 2 Piloto	3.0999999999999996
Turma 5	2.9
Turma 6	3.5
Turma 5	1.8
Turma 6	1.6
Turma 1	8.4
Turma 2 Piloto	4.4
Turma 1	1.9000000000000001
Turma 6	2.5
Turma 6	0.0
Turma 6	4.9
Turma 4	3.6
Turma 6	3.9
Turma 4	0.8
Turma 6	1.2000000000000002
Turma 6	3.0
Turma 6	6.3
Turma 2 Piloto	5.4
Turma 3 Especial	0.0
Turma 6	1.6
Turma 1	1.7
Turma 5	2.1
Turma 1	5.4
Turma 2 Piloto	1.6

I tried to call the function groupedhist, from the package StatsPlots, grouped by the column :Class_Name, and normalized so that I could compare the grades of the students in the distinct classes, which have a different number of students. To that end, I thought the parameter normalize would be appropriate (as suggested in the help for the function histogram), since it would seem to ensure the total area for each group (in the corresponding bins) would sum to unity. After having read the csv file to a dataframe df_aux, I then ran:

using StatsPlots, DataFramesMeta
group_hist = @with df_aux groupedhist(:Grade, group=:Class_Name, 
                   title="Histograms", bins=11, xticks=0:1:10, normalize=:true)

To my surprise, the resulting output plot is given by


which sure is weird: visually we notice that, for instance, the area of the red bars (corresponding to “Turma 2 Piloto”) is manifestly less than the area of the lighter blue bars (corresponding to “Turma 6”)! Could it be that, for the groupedhist function the parameter normalize is incorrectly implemented, if at all? It seems the height of the bars look like the counts in the bins, despite the numeric labels along the vertical axis, which suggest some normalization…
Any help is gratefully appreciated!

I’m eyeballing here, but to me it looks like it could be that all bars (so aggregated over all colors) sum to 1, so perhaps the problem is it is not being normalized per group?

@JADekker You possibly nailed it down (even by eyeballing!): I have grossly checked it guessing the heights for each bar… However, the question is: is this what we in fact would mean? I for one sure would rather have it behave as I described in my original post; what about you? Should I submit this as an issue/bug to (Stats)Plots Github site?

Looks to me like the normalization should happen somewhere here and it uses the total count as a denominator, instead of group count? I’d maybe expect normalization within groups but I’m not sure how obvious it is here.

Here is a link to a related issue in Github.

Hi there,
As I see it, Python Seaborn’s displot function, with its parameters hue, stats and, particularly, common_norm (cf. Visualizing distributions of data — seaborn 0.13.2 documentation, in the section entitled “Normalized histogram statistics”) offers all the options I would like to see implemented in StatsPlots’ groupedhist. Is that feasible? Again, should I redirect this discussion and open an issue at (Stats)Plots Github site?

1 Like