First off, thanks to whomever changed the title.
I’ll try to decode the mystery here without violating NDAs, Privacy policies, clearance, and whatever else the lawyers dream up. I work in a research org focusing on non-standard problem solving, lots of lateral thinking.
The data is a mess. It’s from building systems (like keycard access, light management, facilities management, etc), medical instruments, unknown devices, etc. There are 18 different formats in the raw log files, although semi-structured (keyword or otherwise delimited). Some timestamps are missing years, some are written as sort of HTTP format, “Feb 09@13:34:43.493Z”. I volunteered to explore the data as a side project.
Most of the progress so far is in writing regex and trying to “normalize” the data. It’s then put into CSV format for lack of a better format. I also wrote a parser to write to JSON files, thinking that might be better. I think Arrow, HDF5, or Parquet make a lot of sense as destination formats.
Originally, I wrote everything in plain old ruby because it was fast to prototype the code and get results. Again, it’s a side project to my 40+ hour a week normal workload. I started down the past of Crystal, but found it a bit immature at this point. It is basically compiled Ruby on the other hand. While looking into R, Python, etc, I ran into Julia. I find after two weeks, I can prototype code almost as fast as I can with Ruby.
The raw data set is around 1 trillion log lines. When throwing out what i don’t think is needed, we’re down to 100 billion lines. When parsing that down to CSV/JSON formats, it’s 33 billion rows or JSON documents. When ordering everything into a giant timeline, we find there are about 258 million log lines per 24 hours period.
With the sheer performance of Julia, I was trying to find a way to just use julia to do all of the analysis without having to resort to “enterprise” databases and all their overhead. I think the “julia way”, from what I’ve learned from this helpful thread, is to use arrow/parquet/hdf5 files instead of csv/json or other text-based storage formats. And I need to do more data cleaning and analysis, not write my own in-memory database in julia.
I hope this thread helps others too. I’ve certainly learned a plethora from it.