AlgebraOfGraphics documentation frustrations

After undertaking one more attempt to wrap my head around AlgebraOfGraphics I started to suspect the problem is not only me being stupid, lazy, and unable to abstract logical thinking.

Ok, please help me to make some simple plots. First let’s produce some sample data

using AlgebraOfGraphics, DataFrames

# sample data
function make_y(v, n)
    m = Matrix{Float64}(undef, length(v), n)
    for i in 1:n
        e = 0.4 + i * 0.1
        m[:, i] = v .^ e
    end
    return m
end

xs = 0.0:10
m = hcat(xs, make_y(xs, 5))
nms = vcat("x", ["y$n" for n in 1:5])
df = DataFrame(m, nms)

I’d like to have

  • Lines of the same color
  • Lines differentiated in color using default palette
  • Lines differentiated in color using given palette from ColorSchemes.jl (if supported)
  • Lines differentiated in style
  • Scatter differentiated in color
  • Scatter differentiated marker form

OK, the first is easy and kind of logical:

plt = data(df) * mapping(:x, names(df)[2:end] .=> :y) * visual(Lines) |> draw
3 Likes

You’re starting out with a wide-format dataframe which is not the most convenient format for AlgebraOfGraphics. It does have some convenience built in for that case as well, but generally it’s a bit easier to start with long-format because then you don’t have to wrangle the multi-dimensional mappings.

Let me try to put it into high-level perspective: In AlgebraOfGraphics you use a tabular data source and specify columns you want to plot. These columns are split into groups by specifying categorical columns in mapping. Each group then becomes one “trace” or separate plot, maybe split across facets even if you use layout, row or col. But there’s another higher level of grouping and that’s the multidimensional or “wide” case. This means that you basically define a “tensor” of mappings and for each element in this tensor you do the whole pipeline of grouping by categorical columns etc. So here it’s just a one-dimensional tensor (the vector of y columns) but it can go to arbitrarily many dimensions in principle. The dimensions don’t matter much except for the dims mapping helper which is special as it makes a faux categorical variable along one (or more, but usually one) dimension of the mapping tensor. This is cool and all, but the zero-dimensional case (where all mappings are just symbols) or long-format is the simplest and should be what you start with.

So here are the examples you wanted in wide format, note the need for => renamer(ys) because the dimensions of the multidimensional input don’t automatically have names (maybe the could have them in simple cases but generally different mappings could contribute to the same dimensions). I factored out the ys variable to keep it less verbose.

xs = 0.0:10
m = hcat(xs, make_y(xs, 5))
ys = ["y$n" for n in 1:5]
nms = vcat("x", ys)
df = DataFrame(m, nms)

data(df) * mapping(:x, ys) * visual(Lines) |> draw
data(df) * mapping(:x, ys, color = dims(1) => renamer(ys)) * visual(Lines) |> draw
data(df) * mapping(:x, ys, color = dims(1) => renamer(ys)) * visual(Lines) |> draw(scales(Color = (; palette = :Set1_5)))
data(df) * mapping(:x, ys, linestyle = dims(1) => renamer(ys)) * visual(Lines) |> draw
data(df) * mapping(:x, ys, color = dims(1) => renamer(ys)) * visual(Scatter) |> draw
data(df) * mapping(:x, ys, marker = dims(1) => renamer(ys)) * visual(Scatter) |> draw

And here’s the same ones in long format. Most are simpler, only the first one needs the additional group mapping because there’s just one zigzagging line otherwise:

dfl = stack(df, ys)
rename!(dfl, :value => :y, :variable => :group)

data(dfl) * mapping(:x, :y, group = :group) * visual(Lines) |> draw
data(dfl) * mapping(:x, :y, color = :group) * visual(Lines) |> draw
data(dfl) * mapping(:x, :y, color = :group) * visual(Lines) |> draw(scales(Color = (; palette = :Set1_5)))
data(dfl) * mapping(:x, :y, linestyle = :group) * visual(Lines) |> draw
data(dfl) * mapping(:x, :y, color = :group) * visual(Scatter) |> draw
data(dfl) * mapping(:x, :y, marker = :group) * visual(Scatter) |> draw
9 Likes

@jules, thank you for your explanations.

My questions were just one side of my post.

Another side is the package documentation, which, while looking beautifully, is apparently of little help for a user like me.

I could probably help to improve it by continuing asking naive questions, and maybe providing some specific suggestions, but surely I can’t re-write the docs on my own.

Now some specific notes.


Knowledge of long and wide formats is assumed as given. In the very beginning, the docs just says:

…“tidy” (long format) tables as input … “Tidy” tables are the most common input type, but wide data, pregrouped arrays, and other input types are also supported.

As of this evening, I understand what long vs wide format means, but actually it’s the first time I’m confronted with it, despite decades of experience in sciences and engineering. Just I’m not a statistician.

As I was not clear about the data format expected by the package, I couldn’t really proceed any further.


The mapping reference doesn’t really explain what are the positional arguments, and which named arguments are accepted, what data types they take, and how are they proceeded further.


Should we continue? I understand, reworking the documentation is a substantial effort, assuming “somebody” has a capacity and desire.

Or maybe I’m just the wrong type of user, had wrong expectations, and AoG is not for me. Then just please replace the phrase in the introduction

No familiarity with plotting or data analysis in Julia is required

by

You are expected to have some prior knowledge in R, tidy, you name it…

P.S. I definitely highly value the work of Makie creators. Sorry for sounding frustrated - it is just because I am :frowning:

2 Likes

Perhaps I may quote the first paragraph of the Conclusion to my recent article about Julia 1.12 in LWN (Julia 1.12 brings progress on standalone binaries and more [LWN.net]):

Some of the new features described above are essentially undocumented. As has been the case in the past, and as is the case with far too many Julia packages in the public General Registry, I had to find out how they work by perusing GitHub issues, forum discussions, and source code, but mainly by extensive and time-consuming experimentation. This is a blind spot widely afflicting developers in general; in the case of Julia it is an obstacle to wider adoption
of the language.

9 Likes

Sure you may quote it, but does it apply here? I put a lot of work into AoG’s documentation, it even has a tutorial series now that guides you through a lot of the functionality. So I’m not sure why that is comparable to Julia’s own tendency to merge features without much documentation and makes you mention the term “essentially undocumented”.

The problems with the docs stated above refer to its use of terms like “tidy” or “long format dataframes”. It’s good feedback to hear that these may not be familiar to some users, for me they’re certainly so ubiquitous (because they dominate the R world for example) that I wouldn’t have thought of explaining them much further. But anyone is welcome to file a PR to adjust wording, add references, etc. so that is easily fixable.

Also the reference for mapping is mentioned and that it’s not easy to understand what the positional and named arguments are for. I reread Mapping | AlgebraOfGraphics and to me it feels like all the information is there, but I cannot simulate what it’s like to read this without my own background. So more helpful to me would be concrete suggestions what to add, in form of a PR ideally which is easier to make progress on.

Edit:

Maybe some of my own frustration leaks through in the paragraphs above, I’m not sure. But especially given the quote above I want to remind that all this is to a large part the result of free volunteer work, just done out of intrinsic motivation to create something nice and useful for me and other people. “Adoption” doesn’t really gain me anything except the feeling of doing something worthwhile, it mostly adds more pressure by increasing the number of people who want my time for support. That’s very different from a company that wants to sell you things to make money, where you can certainly say “for that amount X they should really offer more Y”. If you apply that same attitude in open source volunteer world it just sounds weirdly entitled and doesn’t capture the collaborative spirit of it at all.

33 Likes

I think there’s good points here. It’s true that documentation being insufficient for learning from scratch is a general problem, and it’s also true that it’s REALLY hard to write something like that even for very dedicated and thorough developers. “Scratch” isn’t even a consistent reference point; one person may have a passing familiarity with programming languages and the application’s topic, another may not even know what a subroutine is. It’s hard to fault developers sometimes for not explaining something that is likely or should be explained elsewhere. Even for very established languages like R and Python, I learned maybe 20% at most from great documentation, there were so many Q&A posts and forum discussions that do a faster and better job at telling a beginner what I need to know. Wider language adoption and good online tutorials are really a chicken-and-egg situation. FWIW, AlgebraOfGraphics has relatively good documentation, and plotting is just one of the harder topics to learn and teach; graphics can’t get away with a simple README or an API reference with a few REPL lines.

3 Likes

Jules, not only your efforts but the efforts of all who work on Makie and AoG are deeply appreciated. There is, of course, still a learning curve. As the author of A Tour of the Calculus notes

Nothing in the appendices is beyond the grasp of the ordinary reader, but there is no avoiding the fact that confrontation with proof is quite often a humbling experience. The eye slows; a feeling of helplessness steals over the soul. At first, it seems as if the confident language of mathematical assertion constitutes a subtle form of mockery. There is no help for any of this save the ancient remedies of practice and a willingness to put pencil to paper.

When I initially encountered Makie, I was bewildered and the documentation was in far poorer shape. Nevertheless, from what I could see of demos of functionality, layout, etc., there was “gold in them thar hills,” for which I persisted. The docs are now far better.

Naming issues remain, some inherited, and now legacy, which cannot be changed. In R, axis meant a singular axis. In Makie, the Axis object, perhaps following the matplotlib / matlab “Axes” object, meant both axes. In fact in matplotlib the meaning is broader: “axes is not the plural form of axis, it actually denotes the plotting area, including all axis.” This caused commenter Jan M. to swear under his breath.

Of course, such issues are not limited to programming. Has anyone tried to change the clock setting on a Samsung stove or an older model Subaru?

The remedy surely is not to tax overworked contributors to the packages. We can ask here, make PRs or open issues, write more (as Lee Phillips has done in articles and a book), and generally exercise patience. I love programming in Julia and hope it thrives. Recently, my plugging along with Makie paid dividends in displaying a multiplot layout of an involved data analysis. There is indeed gold here.

5 Likes

It seems a bit strange to me to single out AlgebraOfGraphics.jl, which in my experience is one of the best-documented Julia packages - the tutorials for example are an incredible service.

Regarding the specific point of wide vs. long format, maybe it would be enough to include a reference? It is a quite general concept and it is arguably not up to a plotting package to introduce it in detail.

10 Likes

@LeePhillips may I ask you whether you have taken any time to study specifically the docs we are speaking about, before applying your sweeping generalization to this specific case.

It is singled out for a very obvious reason - I’m trying to learn this specific package. This is the context BTW.

I am very well aware this is FOSS, and I am not entitled to anything. And, let me repeat, I do highly appreciate your work. Definitely my intention was not bashing on the developers.


I have more to add, but right now have things to do. See you later :slightly_smiling_face:

2 Likes

That’s totally possible! I personally tried ggplot and aog, but this paradigm hasn’t really clicked for me.
They are fine for a set of basic plots that perfectly align with the “happy path”, but for basically anything I found that I fight the “grammar” more than it helps me. Makie itself is quite well designed, so you might want to just continue using it without the grammar layer on top.

1 Like

OK, my technical takeaway from the explanations and a bit of experimentation:

  • Put your data into a long-format table before feeding them into AoG. Don’t try anything else. Full stop.
  • mapping would happily accept any number of positional, and any names of keyword arguments. It is the downstream functions which make (or not) sense of these arguments, and where it is probably to look for explanations. For XY-plots, the first two positional arguments are, correspondingly, X and Y; color, marker, and linestyle are among accepted keywords.
  • The actual values of the data passed to color etc. is irrelevant, especially taking into account it is e.g. the series name or some other cathegorical data. The sorting is in the order of the first occurrence of each value in the dataset.

On general things - more to continue later on.

I made a PR to add this info and examples to the docs: expand docs for wide vs long & mapping arguments by ericphanson · Pull Request #698 · MakieOrg/AlgebraOfGraphics.jl · GitHub

7 Likes

AoG is one of the best-documented packages I’ve used. It was largely on the basis of the quality of its documentation that I decided to try using it in my work to see how it fared. I’m glad I did, because this wonderful package made graphical exploration of, in my case, simulation results easier, more convenient, and more fun.

But I did find some concepts difficult to absorb. While that was mostly not a fault of the documentation but, rather, a consequence of the flexibility and newness (to me) of the AoG approach, @Eben60’s complaints are examples of a frequently encountered pattern in the way that documentation falls short. This is the “blind spot” that I referred to in my article. It is supremely difficult for software authors to “forget” what they know about their own creations and put themselves in the places of their readers — so we get terms and concepts used before they are explained.

So I’m suggesting that it may be worthwhile to take @Eben60’s comments seriously and search out such bind spots in the documentation. In this way documentation that’s already very good can become even more useful.

I certainly didn’t mean my comment about some Julia features being “essentially undocumented” to apply to the AoG documentation; that came at the end of an article devoted to some new features of Julia 1.12.

3 Likes

There are at least four kinds of documentation! About | Divio Documentation
Some developers only pay attention to one. Which may prove unfortunate when the other kind is required.

@PetrKryslUCSD may I ask you the same question?

1 Like

No, why? My comment was not specific to that package.

Because the discussion is specifically about this package, and it is naturally to assume it is your opinion about the docs and the developers of this very package.

You know, generally there are different kind of comments. Some are…

1 Like

I now have taken a look. Nice looking docs. But they do mix the different kinds of documentation. Perhaps that may be why some people find the docs less than optimal…

I find some comments in this thread highly unfair to the plotting package we use as an open source tool (AlgebraOfGraphics), which receives frequent, immediate support from its leading developers, and we do not have to pay a penny for all the functionality at our disposal.

In my work, I have done a lot of plotting. A long time ago, I used MATLAB, and the current AoG documentation is much better than that of the previous private software (e.g., even today in MATLAB, we can only access some plot images if we log in to our account). In Julia, I have extensively used PlotlyJS (or PlutoPlotly), and the AoG documentation is not worse than that of Plotly.py (the original plotting package developed for Python and used also in Julia). We may find fewer examples in the AoG docs than in Plotly, but that is easily explained by the number of years both packages have been with us (and the number of users is also a relevant issue). I have also used Plots and PGFPlotsX, and their documentation is no better than the AoG one.

In Python, I have used Altair, Bokeh, and Plotly. The documentation is very similar across those three plotting packages, and once again, what we find is a large number of examples, much larger than those in the AoG. However, I do not see the AoG documentation itself as worse than that of the Python packages mentioned above. In fact, I find it better organized, more appealing, and modern. Given the massive amount of plotting cases that we may have, it is almost impossible to have all the small details that may be relevant for implementing a particular type of plot in the documentation of any plotting package. And this happens in Python, in Julia, and in any other language ecosystem we know.

For example, consider the classical case of the Palmer Penguins, which involves the distinction between wide- and long-formatted data and is often used as a nice example of what a good package can accomplish. This data set is in long-format and is widely used in the AoG as an example of what the package can achieve with a single dataset. The treatment of this data set in the AoG is second to none, as far as any plotting package is concerned, to the best of my knowledge. I give four examples of how the documentation usually handles small details, e.g., data formats: Vega-Altair, Seaborn, Bokeh, and Plotly. In all of those packages, there is no entry at all about the nature of the data. Please see the image below and the three links to make the case as straightforward as possible: Bokeh , Seaborn. Plotly.py does not have an entry for the Palmer Penguins in the docs, but there is a Kaggle presentation Plotly For Palmer Penguins. Notice that in all these cases there is not a single line about the long-format of the original data set. In all these examples, people show how to read the data and do a particular plot. That is all! Better, it is impossible in open-source software (and the same is true in private software as well).

If we want to learn more about what a particular package can do, we have to look somewhere else. Two excellent examples of Makie and AoG can be found in 6 The AlgebraOfGraphics package and Puma Tutorials. However, notice that even in the excellent chapter by @jverzani, there is not a single line about the nature of the long-form data. People infer that we know the difference between the two formats and their implications. If we want to learn more about the different data formats and their use in Makie and AoG, we can look at the second link above, as the document was conceived with that in mind from the beginning.

Finally, a disclaimer. I am not a contributor to Makie or the AoG packages, not even a regular user of those packages. I play around with them from time to time. I want to thank their leading developers for the enormous amount of work they put into their development, and congratulate them on two excellent software packages that I can use free of charge, with technical support, and the possibility to request improvements that I may be the most immediate beneficiary of.

13 Likes

@VivMendes may I ask you which comments and what exactly do you find unfair. Without naming wrong things it is difficult to correct them.