Where to begin for using Julia ML (Flux/MLJ/MLJFlux) with custom datasets?

Hello again, friends,

I thought I’d ask this as ML and data science are somewhat out of my forte.

I’d like to use Julia for analyzing different datasets, not just for work. So to give you an idea, let me give a simple example:

Say we have a some S3 Bucket somewhere that holds parquet files, and we can read them in via DuckDB with DuckDB.jl and view them and plot them with DataFrames.jl, as well as slice up the data quickly as opposed to someone doing it with a CSV file in Excel. Lovely, however say we wanted to use ML with Julia to do some predictions. For the sake of it, because I’m not typing out a parquet file, we have the following data as a CSV:

Quarter,Year,"Total Sales","Total Business Expenses","CAPEX","Sales Percent Change from Previous Quarter","Percent Market Inflation Change","Insurance Paid"
1,2010,4502,1030,1000,0.12,0.1,2001
etc.

Let’s say this data goes up to this quarter and year. We want to get a decent estimate of numbers like total sales, adjusting for inflation, for the next 12 quarters. There’s likely lots of other data (columns) that could be in this example dataset (more like there would have to be in order to be more accurate), but for the sake of it that’s all I’m typing out, haha.

My question is…where does one start when using data like this with Flux and/or MLJ? I have to better learn about stuff like loss functions and the different kinds of regressions, but at least looking at the guides for Flux and MLJ they mention using existing datasets or just array with randomly populated data, but not custom ones.

I could be asking this and the answer is right in front of me, but I could also just not be too bright.

What would you use if not an already existing data set? Look at the tutorials here:

https://juliaai.github.io/DataScienceTutorials.jl/

Although time series forecasting is an area that “traditional” machine learning models that work well for predicting cross-sectional tabular data (like XGBoost) have struggled, not sure what the state of the art currently is but I think some newer specialist models have managed to win the M4 competition in recent years.

3 Likes

Where I work we maintain our own data that I think would be worthwhile analyzing with ML.

That’s great, although I’m not 100% sure on everything you just said. :sweat_smile:

While I’m not necessarily try to implement some kind of market prediction based on data that’s in the past, I do see Julia as a good tool for stuff like that. Obviously our own datasets might not be the only ones we would or maybe the best (although given our niche field, I doubt it), it’d still be a primary one to use.

On a personal level I would also like to use Julia to run analysis on economic data, like the US M1, M2 and M3 money supplies relative to reporting on inflation from the Federal Reserve and their relation to things like costs of certain consumer goods, etc. That data is out there and easily downloadable, though.

If you already know how to get your data into a DataFrame, you can use it with Flux or MLJ. It’s not really clear what the issue is…?

1 Like

I guess my point is when you say

we maintain our own data

that means this data exists, right? There really isn’t a difference between doing data = CSV.read("fred_m1.csv", DataFrame) or data = CSV.read("my_internal_data.csv", DataFrame)

If the issue is about the format that your data is in then it would probably be helpful to explain this in more detail.

Admittedly, I may have hurt myself in confusion of taking everything new about Julia, ML/AI and all the other tools I and my team would like to use. It’s a lot to take in.

I think in my mind it initially was, as looking through a lot of the examples out there, the data feels almost so foreign, but I, as mentioned to mthelm85, I might’ve just gotten overwhelmed and confused myself.

This is a pretty new field of computing for me, as I do more software development and sysadmin than data science and analysis for work and in my own time.

We’ve all been there - that just means you’re learning :slightly_smiling_face:. You might consider loading your data into a DataFrame and then starting with some simple time series forecasting methods. I really like StateSpaceModels.jl, but there are other options out there as well.

1 Like

That’s fair enough. I would encourage you to just give it a go though, probably following some of the MLJ tutorials I linked above.

Just see if you can find one which uses a data set that is roughly comparable to yours (e.g. if your outcome variable is the price of something then maybe the Boston Housing Data examples might be useful). I don’t think Flux is relevant to you, it’s more useful when you try to build your own model - you should probably start with some existing, well-tested models.

I should also add if really all you are after is a time series forecast, you are probably better off using the forecast package in R, there isn’t really anything in Julia that can compete for time series forecasting in terms of depth and breadth of functionality.

1 Like

I’m guessing TimeSeries.jl doesn’t always fit the bill?

EDIT: I also found the Data Science in Julia for Hackers page (still can’t post links).

I’ll consider that. I think part of why I considered DuckDB.jl is because it’s cool.