Building some Data analysis Tutorials

dlakelan · May 30, 2020, 3:48am

For several years I’ve wanted to jump into Julia. I tried last summer but wound up with no time because my wife was teaching a course and I had the kids full time. Now I have got a couple projects where I’m using Julia and would like to improve my own knowledge of the various tool-sets while contributing back by producing some Tutorials for loading data, and doing graphical data exploration and analysis / model fitting etc.

In the past I would always do this sort of thing in R, but I’ve become increasingly frustrated with the baroqueness of the modern R universe. Everything in R is a list of lists of vectors of arrays with named dimensions that’s secretly an S4 object or maybe an S3 object or whatever. The Hadleyverse is full of unusual evaluation, some things that look like variables are actually unevaluated symbols, quasiquotation, quosures, UGH!!! When it comes to computing, at heart, I’m a LISP hacker and it drives me crazy. And R is slow as a dog if you want to actually code anything that isn’t a call to a C function. Julia is what I want to use!

So anyway. I’d like to build a series of data analysis Jupyter notebooks using Queryverse, VegaLite, Distributions, GLM, Stan.jl, DifferentialEquations.jl and soforth.

Some of you may know me as a frequent commenter at https://statmodeling.stat.columbia.edu/ where I try hard to help teach people about data analysis ideas and open science and especially Bayesian thinking. So now, I’m going to try hard to teach people about data analysis practice, in Julia. Especially myself!

In some ways I’m carving out this little thread to discuss progress and get feedback. If the admins prefer it in some other section/category feel free to move it.

So, that’s my plan. At the moment I’m thinking of just opening a github repo for the notebooks. I’m not super knowledgeable about Jupyter. Does anyone have any alternative suggestions about publishing these notebooks?

EDIT: also I recently watched the talk https://documentation.divio.com/ and his point about Tutorials being the hardest and least commonly done thing rang true. So I’m hoping to rectify that here. But there may also be some HowTo and Commentary documents as well. I’ll try to keep them separated!

robsmith11 · May 30, 2020, 4:14am

I’ve read some great posts on Bayesian methods on that Columbia site. Posting updates here would be a great way to engage the Julia community I think.

Regarding Juptyer notebooks, I don’t use them for my work, but I find Github’s notebook viewer to be a convenient way to view the notebooks others have shared. Perhaps pull requests could be a way to get feedback as well.

tlienart · May 30, 2020, 5:07am

Hello! It could be cool to coordinate, I help maintain what is now DataScienceTutorials.jl (. https://alan-turing-institute.github.io/DataScienceTutorials.jl/ ) It’s mostly got tutorials using MLJ at the moment (that’s what it started with) but we’re hoping to also include tutorials from other packages and frameworks in Julia & reorganise the content to be more inclusive & accessible.

So if you intend to do this anyway, I’d be happy to add your tutorials on the website (with creds to you of course).

nilshg · May 30, 2020, 7:30am

I would also encourage you to look at Turing.jl, Gen.jl and Soss.jl as alternatives to just calling Stan from Julia!

dlakelan · May 30, 2020, 12:41pm

Thanks for suggested packages. Turing was on my radar but Gen and Soss weren’t.

@tlienart will take a look at your tutorials. Certainly would be happy to have you link to my materials.

cpfiffer · May 30, 2020, 4:24pm

+1 for Turing.jl. If you’re interested, there’s some good tutorials here: Tutorials

dlakelan · May 30, 2020, 5:33pm

Awesome. I will check those out as well.

The intent here is to use the julia ecosystem, but to teach analysis of data. It won’t be about “here’s how you use x package” but rather “Let’s answer some questions about several datasets using whatever in Julia seems to aid that goal”

In line with the separation between the quadrants of that documentation talk, there will probably be fairly simple documents that are step by step grabbing data, looking at it, coming up with questions, answering them using whatever seems appropriate in Julia, and then providing some workspace and exercise suggestions at the end… Those will be the “tutorials”… Then I may write separate documents that are discussions: why did we choose to do things the way we did in the tutorial, what are some other options, etc. that would be where I might compare and contrast other packages you could use etc.

If that sounds awesome, then I can tell you I’m looking forward to reading them as much as the rest of you

kevbonham · May 30, 2020, 7:07pm

I’d encourage you to look at Weave.jl (which is kinda like R markdown) or Literate.jl, both of which enable you to edit in plain text but can export as jupyter notebooks, scripts, or html.

Notebooks are fine for working in and displaying results, but I find the experience of version controlling them quite tedious.

dlakelan · May 30, 2020, 7:34pm

I definitely wanted to provide the results as Notebooks so that people can interact with them, but I agree that writing stuff in a static literate language seems better. If they can compile to notebooks I’m up for using Weave or Literate. What are the relevant differences between the two? Which one do you tend to use?

kevbonham · May 30, 2020, 8:03pm

Weave.jl is like R markdown. So the document is markdown, and code is fenced, like:

# Here's a Header

Some text with **bold**

```julia
f(x) = x^2 + 4x + 2
f(4)
```

With Literate, the file is a julia file, and the explanations are in comments.

# # Here's a header
# 
# Some text with **bold**

f(x) = x^2 + 4x + 2
f(4)

I really like that in Literate, you can specify certain lines to only show up in Notebook exports,
or only show up in Markdown exports, and I like the fact that the file is a runable julia script (though with weave, you can export as a script). It’s also designed by one of the main contributors to Documenter, and so has a lot of nice features allowing the markdown export of Literate to play really nice with Documenter.

The major downside of Literate IMO is that there isn’t a great deal of tooling for things like Atom or VS code. So the markdown isn’t syntax highlighted, and when you write markdown with a lot of linebreaks as I do, it’s annoying to have to add the comment marker on every line (or the #nb # if you want a notebook-filtered line) etc.

One of the benefits of Weave in its own right is that there are a lot of options for code blocks, like hiding the output of a cell (or hiding the code and only showing the output).

I tend to use Literate when my thing is code-heavy, when I want to run it as a script, or I want to use it with Documenter. I use Weave when there’s a lot of explanatory stuff or when I need more control over my code fences.

For what it sounds like you want to do, I’d probably recommend Weave, but only like 65/35. Hope this helps!

dlakelan · May 30, 2020, 8:38pm

Your reasoning is sound, and makes sense in my use case. My documents will be probably at least 50% explanation, and commenting everything would be irritating I think. Plus I’m familiar with Rmd so I’ll probably go with Weave. Thanks!

mthelm85 · June 1, 2020, 6:47pm

Three things:

Check out the Julia for Data Science YouTube series that’s on the official Julia channel. A link to the first video in the series is here.
Check out mybinder.org for making your notebooks fully executable in the browser, without the user having to download anything. I created a very basic, intro to Julia notebook (specifically for colleagues of mine) that I have running on Binder so you can see what that looks like here: Binder
I’ve been very frustrated at the lack of good data analysis/data science content on the web that relies on Julia. There are loads of great courses on a variety of online learning sites that make use of R/Python but almost nothing for Julia. I would be happy to contribute to this project and would be interested in linking up with you to share thoughts/organize an outline for topics to cover.

dlakelan · June 1, 2020, 6:57pm

Yes, it’s very frustrating for someone who knows a bunch about data analysis, in say R or Python, but wants to move to Julia and get up to speed at their former level of knowledge. So I’m hoping to alleviate that and also teach a bit about data analysis.

I am very happy to partner on this. I am really just getting started on the project though. How about I PM you on the forum here, and we can discuss some ideas there, and then feed the more fully formed ones back into this thread?

mthelm85 · June 1, 2020, 6:58pm

Sounds great!

kevbonham · June 2, 2020, 3:33pm

Feel free to loop me in on this too. I’m developing a course right now (starting next week ) so may not be super available. But I’m still coming up with assignments, so there may be some mutually beneficial work to be done. My stuff will largely be biology focused, but I was planning to work with some covid datasets, so there may be broader interest

dlakelan · June 2, 2020, 5:03pm

Nice. I have worked with biologists quite a bit over the years. What sort of topics are you working on?

I am in the process of writing the first of these tutorials, it basically downloads a public Census dataset, munges it, and makes a variety of plots to answer very basic questions about the data. Once that’s in a viable form I’ll put a git repo up on github and mention it here, we can discuss how to build on that foundation in different directions.

I think the “learn by doing” with not too much excess explaining is powerful. I do like to explain, so I’m thinking of having a companion to each tutorial that’s a discussion of why things were done, and why other things weren’t done etc.

alejandromerchan · June 2, 2020, 9:03pm

I’m also interested in this process. I work for the state of California and I use Julia for some basic data analysis and manipulation. I have some scripts that access some pretty comprehensive database about pesticide use in the state and have been meaning to learn more about the process, but also share some of the stuff I know. @mthelm85 for example, helped me in the past to do some mapping using VegaLite and I did a scientific presentation with that.

Please do not hesitate to ping me or message me.

kevbonham · June 3, 2020, 11:15am

I currently study the human microbiome (in kids, looking at relationships with cognitive development). The course will include sequence analysis, using web APIs for biological datasets, phylogenetics and a bunch of other stuff

dlakelan · June 3, 2020, 9:23pm

Do you do sequence analysis in Julia btw? What tools are there for this kind of thing?

kevbonham · June 3, 2020, 9:34pm

BioSequences.jl and other stuff in BioJulia, mostly. I don’t do so much of this at the moment, and for this course I plan to do very basic things with strings mostly, or have them implement stuff themselves.

Topic		Replies	Views
Best way to learn Julia for non-programmer data analyst? New to Julia question , recommendations	22	2160	November 21, 2022
Choosing a numerical programming language for economic research: Julia, Community blog , blog-post	135	6311	May 30, 2023
Recent experience with Julia as the main data science driver General Usage	18	3615	August 8, 2021
What's the current (spring 2024) canonical approach to data science in Julia? General Usage dataframes	34	4168	April 8, 2024
Please recommend a Julia ecosystem for Statistics New to Julia	28	4243	June 8, 2019

Building some Data analysis Tutorials

Related topics