Seeking advice on learning ML

I’ve done plenty of dabbling in ML and written some models that work (albeit, not very well) but I’m finding it difficult to 1) navigate the many ML-related topics and decide which ones to spend more time learning and which ones to gloss over 2) understand how to fine-tune a model once it’s working.

I’m hoping there is enough charity here in the Julia community to help guide me in the right direction :blush:. The idea I have is to take some data from the Census Bureau’s American Community Survey (ACS) and explore different ways to predict a person’s income. There are millions of rows of data in the 2013 - 2017 ACS file that include tons of measurements about the individual survey respondents (age, educational attainment, occupation, the industry in which they work, race, gender, etc.).

It seems like there should be enough information in this dataset to make fairly accurate predictions. I’m thinking I should start out simply trying to predict whether or not a person earns above or below some threshold amount by building a logistic regression model, a random forest, a neural network, and maybe some other model that’s good for this kind of problem.

Does this sound like a good start? Is there any reason I should start with one type of algorithm over another? Any tips/advice you can give? Does this sound like a decent ‘beginner’ problem to solve, or is it too complex?

Lastly, if anyone is interested in learning these topics, please reach out as I’d love to collaborate and learn together.

Thanks!!!

I can highly recommend The Elements of Statistical Learning or An Introduction to Statistical Learning. These books come with many rather small data sets that I find useful as examples to develop an intuition. I would not necessarily start with a huge dataset, because I don’t want to wait for my computer to finish a fit; I prefer when the results arrive within a fraction of a second when I am learning.

But the ACS data looks nice too. Or any competition or dataset on kaggle.

6 Likes

Thank you!

It would probably be best to start from some more standard datasets. The Ames, Boston and California Housing datasets are easier to approach and with less layers of complexity. For classification, MNIST is quite standard. You can find them on https://www.kaggle.com.

Depending on your interest and background you could focus on different topics. In addition to the books mentioned from @jbrea, you can also read Goodfellow, Bengio, Courville (2016) on deep learning, any of Vladimir Vapnik’s books on statistical learning and for a real classic in time series (not an easy book) you can take a look at Priestley (1981).

3 Likes

If you’re inte the process of learning, I would reach for an ML framework that allows you to test many different things. I’ve experimented woth MLJ and believe it to have great potential. It’s similar in spirit to scikit learn, but have been rethought to solve some common pains of scikit. The kind of data you’re talking about lends itself well to prediction using tree and forest models, unless you specifically want to experiment with deep learning.

4 Likes

@baggepinnen @fipelle Thank you!

Thanks for raising the question! There is this recent text co-authored by @yoninazarathy and made available in draft form here: https://people.smp.uq.edu.au/YoniNazarathy/julia-stats/StatisticsWithJulia.pdf


(caught that in this post: Multivariate Normal Distribution - #8 by yoninazarathy)
This seems relevant to your question in particular:

9  Machine Learning Basics - DRAFT . . . . . . . . . . . . . . . 311
9.1  Training, Validation and Testing  . . . . . . . . . . . . . 311
9.2  Bias, Variance and Regularization . . . . . . . . . . . . . 312
9.3  Supervised Learning Methods . . . . . . . . . . . . . . . . 315
9.4  Unsupervised Learning Methods   . . . . . . . . . . . . . . 324
9.5  Reinforcement Learning and MDP  . . . . . . . . . . . . . . 333
9.6  A Taste of Generational Adversarial Networks  . . . . . . . 340

It’s a beautifully typeset text, and you can copy the Julia examples right out of the pdf and generate the same output and plots to validate.
There’s another good stats ref to pass along … thinking, or unthinking, let me dig it up…

6 Likes

Ah, found it! More on the statistics fundamentals. REthinking was the keyword… This may be more on the Bayesian side than what you’re looking for.
I knew it was by @Tamas_Papp - was in a post he made as part of his “ANN: DynamicHMC 2.0” here: ANN: DynamicHMC 2.0 - #15 by Tamas_Papp
It’s “StatisticalRethinkingJulia” associated with the book Statistical Rethinking by Richard McElreath.
Dr. McElreath is Director of the Department of Human Behavior, Ecology, and Culture at the Max Planck Institute for Evolutionary Anthropology. He’s “an evolutionary ecologist who studies humans”, and as such, his lectures are engaging and fascinating!!
There’s a plethora of vids on his Tube channel here:
https://www.youtube.com/channel/UCNJK6_DZvcMqNSzQdEkzvzA/videos
I’ve spent quite a few hours listening to those thanks to Mr. Papp! :face_with_raised_eyebrow: Seriously, watch one, they’re great!

2 Likes

Sorry for the rapid-fire replies, but this just popped to mind, did you see the announcement the other day that Julia Academy is now free and available online?! There are topics specific to ML.

And you might consider spending your lunch hour next Tue with Dr. @mbauman:
Webinar : Machine Learning with Julia
Julia Computing Webinar
Tuesday November 19 2019, 12:00 pm - 1:00 pm US Eastern Time
Mode : Online
Presenter : Dr. Matt Bauman, Senior Research Scientist, Julia Computing
This may be a personal link as it came via email to my Julia Computing linked-address, but the form looks fine (if this should be pulled down, please alert me). Signed up but haven’t heard back yet:

1 Like