I’m hoping to get some guidance on best practices for handling a large dataset comprised of SPY NYSE TAQ data (times & sales). All comments & suggestions are appreciated. Please be gentle, Julia rookie here.
My “coding” experience is… limited. I learned Stata in a grad stats class years ago and even used it somewhat regularly in a past career for basic data wrangling and analysis. I’m a few years removed now from that position but recentIy decided to finally tackle mastering a programming language and landed on Julia which brings me here. I’m all in so I figured I’d start off the learning process by bugging smart people like you with basic questions (before you banish me to the books - I’m halfway through Think Julia, which I’m starting to think I might need to do another lap on, and I did google everything I’m about to ask but couldn’t find a clear answer).
Alright, so where I’m stuck.
Installed Julia check
VS Code setup check
Installed some packages check
hello world check
attempting to understand the big picture (arrays, functions, tuples, dataframes, etc) IN PROGRESS
5a. learn by doing ← I’m here
So I have a SPY Times & Sales dataset with vars date, time, trade condition, trade size, trade price. The dataset is large, >4million rows.
I’d like to:
Import dataset
Drop rows with specific trade conditions
Change datatypes (int64 → date, string → time)
Reduce the data (aggregating trade size, price by second (currently nanosecond)
Learn Julia best practices, protips, magic tricks so that I can do this in the most efficient way
I played around with some code and was able to drop some columns, import the data in dataframes and JuliaDB, and rename the vars but I’m stuck with how to proceed.
The time data is separate from the date. From what I see, it looks like most examples posted use a single DateTime type? to represent date/time. Is this best, if so, how?
Should I parse the datetime on import? CSV.jl only allows one datetime parse and they are separate vars.
Should I trim the trailing nanoseconds from the time string before concat with date?
Once I concat (if that is the consensus), how do I go about reducing/aggregating?
How do I optimize the code so that it runs fast and efficient? When I run df[!, :date] = Date.(df.date, formats_dates) my computer just slowly chugs along on that process - I let it sit for 20 minutes then just figured I’m doing something wrong so halted process. I’m on a M1 MacBook Air 8G ram - do I need more horsepower?
First, pretty likely you will want more than 8GB RAM for this. The file itself is probably almost 1GB, right? When it gets loaded into memory it blows up in size – perhaps triples (or more). When I work with TAQ data (you are not alone here, welcome! ) I do so on a computing cluster and I reserve at minimum 32GB of memory.
Second, whether to drop the ns is up to you. If that level of resolution is needed in your analysis, you can’t drop it! I’m guessing that you are doing things separate at the daily level. If so, you can just drop the date column entirely to save space, just extract the date into its own variable. You can add it back later to any smaller/aggregated result tables. You can then just parse the time column as Time.
To aggregate within seconds (the easy way), I’d probably define a new timestamp that is truncated to the second, eg. sectime = trunc(time,Second). You can then use groupby(df,"sectime") to group within seconds and then add up volumes and trades within groups.
What is your file format? Are you getting them as CSV? Depending on everything you are doing with the files you may not need to materialize the entire file at once anyway.
EDIT: Forgot you already showed us they are CSV files. Don’t forget to start Julia with julia -t X where X>1 to get multiple threads. Both CSV.jl and DataFrames.jl can natively multithread operations now.
kinda debatable whether that is large. imo, i would say no. it’s easily proccessible by a laptop nowadays.
Ok. In general, working with strings is more expensive, so if possible do not convert it to a string. Try this: int_to_date.(df.date), which is about 1000x faster
where the `int_to_date` function is defined below in the details.
You are correct, the .csv weighs in at 1.82 GB. Although, I will say, the M1 8GB punches well above it’s weight class. It just feels like more vs other 8gb systems I’ve used - not once did any other program even stutter while working with that file. Alas, more RAM is called for.
I want to construct different variations of intraday “bars” i.e. volume-based, $-based, information-based (Advances in Financial ML by Marcos Lopez de Prado) and, eventually, use it as a dataset for training a DRL model for futures trading. The ns granularity is most likely not needed as this entire project is just for fun.
I also want to figure out how to use the TD Ameritrade API to parse and store equities, options, futures tick level data in real time (I know, I know. One thing at a time. The ink is still wet on my hello world). Side note: if I ever run for president it will be on the platform of providing free, open source, granular, historical & real-time market data - the price of data is too damn high.
Will give your aggregate and multithread suggestions a try. Again, thank you!
that might be getting a little big for 8gm ram depending on how many columns.
I think you can try other techniques like using ShortStrings.jl and PooledArrays.jl to compress any string columns you have. Can’t recall if there was a blog post about that.
Also you may want to consider JDF.jl and use it’s type_compress! function to reduce dataset size if it’s applicable.
The SparkSQL.jl Julia package is designed for your use-case. Big data processing engines like Apache Spark exist to process petabytes of data. The SparkSQL.jl Julia package allows you to wrangle your data on Apache Spark from Julia. It then returns that data to a Julia DataFrame where you can then take advantage of the wonderful Julia data science packages.
To use the SparkSQL.jl package you will need to learn a language called SQL (Structured Query Language). SQL is the language most commonly used in data engineering, and database work so it is worth knowing alongside Julia.
The SparkSQL.jl package is designed for tabular data and supports important data types like decimal, timestamp, and date. There is a Pluto tutorial notebook for SparkSQL to help people get started:
One additional note as you are working with this data: just load 100k rows or something to test. Your functionality should be agnostic to whether you are working with the file or a subset of it. You can pass in limit=100000 to CSV.File to do this.
skmcgov’s dataset is 45 million rows of data and in my experience Spark works well with data sets that size too. Additionally, Spark has very advanced SQL standard support so you get a lot of expressive power (see window functions for moving averages etc). SparkSQL and Julia used together is a very powerful data science toolkit.
i am sure. I just get better experience with data.table instead. anyway, 45 millions rows with not many columns isn’t that bad. I can process 2billion rows of data on a laptop with disk.frame. so yeah. do I need the headache of java for this? imo, no. but i know some reach for spark as soon as they can. but yeah… it’s not necessary imo.
For the type of workflow you ask for everything should be fast from practical perspective. The biggest cost in all cases will be setup/compilation not execution. As you want to transform data/aggregate, the relevant performance comparison for your size of data is Database-like ops benchmark benchmark in groupby/5GB data section. As you can see any solution will handle your data in several seconds (assuming you would have enough RAM of course).
Thank you for your interest in the SparkSQL.jl package. The project is actively maintained.
Iceberg and Delta Lake are both supported. There is an example Pluto notebook on GitHub that shows how to use Delta lake. Iceberg support works similarly.
An iceberg tutorial may be developed in the future.