TLDR: Skip to the 3rd paragraph for the questions.
backstory
I’m working with a decent sized dataset, around 100 billion log lines from sensor data, updating at 258 million lines per day at nanosecond resolution. Originally, I quickly wrote some prototype code in ruby to analyze the data from the logfiles in place. After the disk i/o became too great and burnt out disks, we tried a series of databases: sqlite, mariadb, postgresql, lucene (solr and elastic search), cassandra, and now mongo. What I learned is that database administration is a full-time job and distracts from getting results from our data, and also requires hardware which is the equivalent cost of 3-5 grad students. And, none of the databases perform as expected/advertised without massive hardware clusters.
We just received a grant for another six months of progress on this research. Last month, I started to wholesale re-think what we’re doing, which led me to julia on the recommendation of some friends in various other organizations. Compared to ruby, julia is vastly faster at everything. I’ve been able to write new code at far higher productivity and code performance levels than in the first prototype period.
the questions
Working with a statistically valid subset of the data, around 33 billion log lines, what is the “julia way” to work with the data? DataFrames.jl looks nice, but our subset data size is around 10TB and that doesn’t fit in ram (we have 1TB). JuliaDB seems abandoned? Even with this smaller dataset, SQL databases struggle. I wrote a parser, in julia, to parse the raw logs into csv, which greatly reduces the total data size to around 4TB. We did upgrade the servers to pure NVMe disks, so filesystem i/o is vastly faster now, but still 1000x slower than memory.
Could I treat the csvs on the filesystem as a “data base” and write code as queries against those? Is there some other way I’m missing? Do we really have to suffer the slings and arrows of outrageous databases? Could we hire julia computing or some julia consultant/company to help figure this out versus making a grad student suffer through databases and being a codemonkey?
Happy to read whatever I’m missing, while we fight with importing BSON into mongodb this week.
I’m truly impressed with all things julia at this point. julia is a love versus writing c/asm for performance.
Thanks!