Hi everyone,
I really like this idea and some of the comments here are very helpful, thanks a ton! But first things first:
I have always considered grep, sed, awk scripts very difficult to read.
Let me briefly introduce you to miller - you will like it a lot. In my head it’s like a tidyverse version of awk and the Unix toolbox, written in modern Go (originally in C). It’s fast and easy to read, with verbs for common operations, but also a custom scripting language for more complicated stuff. And it’s a super-cool project! Miller alone was worth reviving this topic (sorry about that btw ).
Even though miller is nice, I need more. I work in human population genomics, with mostly TSV data on the order of hundreds of GB. Even after subsetting to particular features of interest I still typically have files the size of several to several-dozen GBs. Just today I reduced my data to 27GB file for the analysis itself.
Now, I write code in R and awk. Even with HPC clusters, nobody wants to put a 27GB file into their R session, especially when it’s for a living. So my awk-fu is getting better, now juggling faster mawk (must be POSIX) and more expressive GNU awk (aka gawk). But I’m really starting to push the limits on these languages. I may try playing with dtplyr or dbplyr. Or…
So, obviously, I’m looking at Julia. It looks awesome! But most of the modules and approaches to data circle around in-memory data frames and such. I hardly found anything about processing files line-by-line, or streams like stdin
and stdout
(this is good start btw, line parsing is next). You know, like awk, or miller. Parsing, incidentally, was a no-brainer in awk and miller, but the main reason I didn’t get far with R. I’ve looked at some approaches in Python and I expect Julia to be similar here - if anyone has some Julia-specific resource on that, throw it my way, please.
To summarise - I can totally see use for a tool in Julia that would make replacing awk & the unix toolbox easy(ier). Be it a module or two for easier scripting, or a full-blown data-munging tool like miller. The author of miller even considered similar approach:
When I was first developing Miller I made a survey of several languages. Using low-level implementation languages like C, Go, Rust, and Nim, I’d need to create my own domain-specific language (DSL) which would always be less featured than a full programming language, but I’d get better performance. Using high-level interpreted languages such as Perl/Python/Ruby I’d get the language’s eval for free and I wouldn’t need a DSL; Miller would have mainly been a set of format-specific I/O hooks. If I’d gotten good enough performance from the latter I’d have done it without question and Miller would be far more flexible. But low-level languages win the performance criteria by a landslide so we have Miller in Go with a custom DSL.
But he didn’t know Julia. Actually, it looks like he still doesn’t know. Maybe I will introduce him, it could be a fun Xmas project…