Replacing grep, sed, awk scripts with Julia

Hello,

I have always considered grep, sed, awk scripts very difficult to read.
I wonder if someone have try to build a tutorial and tools to move from this kind of tools to a proper script language such as Julia.

Python have https://github.com/ksamuel/Pyped
see NameBright - Coming Soon
see also Replace one-liner sed/awk with python - Code Review Stack Exchange

but I’m pretty sure that Julia can have similar tools and could become a nice tool for data cleaning / data preprocessing…

Best regards

2 Likes

Not exactly an answer to your question but WordTokenizers.jl includes functionality to generate julia AST out of sed scripts.

https://github.com/JuliaText/WordTokenizers.jl/blob/5fad6ffb3678bda8e46bc87d9aeafa65bc69d439/src/words/sedbased.jl#L1-L43

2 Likes

Julia has built-in perl-like regular-expression search and replace, so it is pretty straightforward to implement anything you would have done with a sed script.

2 Likes

As @stevengj said, it is usually very straightforward. Eg see an example of replacing multiple strings in my package generator skeleton.jl.

Usually, Julia code is more verbose than sed, awk, and other specialized tools, but I consider that an advantage: whenever I am using the latter, I search the net furiously for examples and end up with something brittle and cryptic.

2 Likes

I wanted for example to do something similar to this Python script.

script.py

import sys
for line in sys.stdin:
    line = line.upper()
    cur_name, cur_symb, _, value = line.split()
    cur_symb = cur_symb.strip('()')
    print(f"{cur_name:>10} {value:>3} {cur_symb}")
    #print(line, end='')

with data.txt

bitcoin (btc) : 5
euros (€) : 100
dollars ($) : 80

which can be called using:

cat data.txt | python script.py

and output:

   BITCOIN   5 BTC
     EUROS 100 €
   DOLLARS  80 $

and finally did this Julia script:

script.jl

for line in readlines(stdin)
    line = uppercase(line)
    cur_name, cur_symb, _, value = split(line)
    cur_symb = strip(cur_symb, ['(', ')'])
    println("$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
end

which can be called using

cat data.txt | julia script.jl

Maybe we should write a kind of tutorial about data cleaning with Julia (taking inspiration from the large number of tutorials about data cleaning teaching how to use iconv, head, tail, tr, wc, split, sort, uniq, cut, paste, join, grep, sed, awk, … and show that we can do all these tasks with ONE tool: Julia (I know it’s not the Unix philosophy)

We should also probably make a command line tool for that purpose which could be used to write oneliners like

cat data.txt | jsed "println(uppercase(line))"

or more complex process on stream of lines.

Commands could be given like

cat data.txt | jsed "line = uppercase(line)" "cur_name, cur_symb, _, value = split(line)" "cur_symb = strip(cur_symb, ['(', ')'])" "println(\"$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)\")"

or using multiline

cat data.txt | jsed """
line = uppercase(line)
cur_name, cur_symb, _, value = split(line)
cur_symb = strip(cur_symb, ['(', ')'])
println("$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
"""

A “specialised” Julia template could be set using a parameter passed jsed but in most case it shouldn’t be necessary.

What is your opinion about such a tool?

PS: maybe we should have something like

for (count, line) in enumerate(readlines(stdin))
    line = uppercase(line)
    cur_name, cur_symb, _, value = split(line)
    cur_symb = strip(cur_symb, ['(', ')'])
    println("$(count) $(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
end

so we can process differently headers than data / skip headers using count so template will be

for (count, line) in enumerate(readlines(stdin))
    ...
end
1 Like

I don’t see the advantage compared to just writing a script.

For the same reasons why people are using sed, awk and other Bash tools instead of using a “complete” language.

1 Like

I’m trying

cat data.txt | julia -e "
for (count, line) in enumerate(readlines(stdin))
    line = uppercase(line)
    cur_name, cur_symb, _, value = split(line)
    cur_symb = strip(cur_symb, ['(', ')'])
    println(\"$(count) $(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)\")
end
"

but it doesn’t work.

-bash: count: command not found
-bash: command substitution: line 1: syntax error near unexpected token `cur_name,'
-bash: command substitution: line 1: `lpad(cur_name, 10)'
-bash: command substitution: line 1: syntax error near unexpected token `value,'
-bash: command substitution: line 1: `lpad(value, 3)'
-bash: cur_symb: command not found

Any idea?

sed, awk etc are themselves rather complex languages (if not as complex as Julia). AFAIK their main advantage is speed, availability, and familiarity (where applicable, if you already use these tools).

But for my own purposes, Julia is already available, it is fast, and I am familiar with it. Also, for complex operations, it has the advantage that I can modularize my code and unit test parts. I find this crucial, and this is why I am not that keen on doing complex things with command line tools.

In fact the problem was because $ need to be escaped with \ when using in command line.

WIP:

jsed.jl

code = join(ARGS, ";")
template = """
for (count, line) in enumerate(readlines(stdin))
    $(code)
end
"""
eval(Meta.parse(template))

Usage:

cat data.txt | julia jsed.jl """
line = uppercase(line)
cur_name, cur_symb, _, value = split(line)
cur_symb = strip(cur_symb, ['(', ')'])
println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")
"""

or as one-liner

cat data.txt | julia jsed.jl "line = uppercase(line)" "cur_name, cur_symb, _, value = split(line)" "cur_symb = strip(cur_symb, ['(', ')'])" "println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")"

or

cat data.txt | julia jsed.jl "line = uppercase(line); cur_name, cur_symb, _, value = split(line); cur_symb = strip(cur_symb, ['(', ')']); println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")"

but jsed need now to be more “generic”, accepting several kind of templates.

We also need to have jsed as a system command (ie available in path) so we could do:

cat data.txt | jsed "line = uppercase(line)" "cur_name, cur_symb, _, value = split(line)" "cur_symb = strip(cur_symb, ['(', ')'])" "println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")"
2 Likes

I like your idea. :slight_smile:

But I would probably start with package as helper for this things. (You probably are going there with your WIP)

You could add next line into your .bashrc file:

alias jsed='julia jsed.jl'

Or you could use apostrophe:

$ echo 'hello ;)' | jsed '''
println("$line")
'''
hello ;)
1 Like

I completely agree with your points regarding data cleaning.
HOWEVER as a systems guy - we will continue to use bash/sed/awk and friends.
Why? You can usually expect to have these installed in a default Linux installation.
Julia is not a default install.
My argument falls apart for cross platform Windows and Linux of course!

Also systems guys are VERY wary of making systems dependent on any “non standard” versions.
I have a genuine war story of a $million SGI system which would not boot following an update of the bash version (not an official SuSE update). Cue me booting from a USB stick and reversing the update.

2 Likes

I use the_silver_searcher which I really quite love.

2 Likes

I kinda feel ripgrep obsoleted silver surfer.

5 Likes

Cool, I was not aware of this. Looks like there is even a vim plugin which was one of my main uses of silver searcher. Looking forward to experimenting with it.

Hi everyone,

I really like this idea and some of the comments here are very helpful, thanks a ton! But first things first:

I have always considered grep, sed, awk scripts very difficult to read.

Let me briefly introduce you to miller - you will like it a lot. In my head it’s like a tidyverse version of awk and the Unix toolbox, written in modern Go (originally in C). It’s fast and easy to read, with verbs for common operations, but also a custom scripting language for more complicated stuff. And it’s a super-cool project! Miller alone was worth reviving this topic (sorry about that btw :sweat_smile:).

Even though miller is nice, I need more. I work in human population genomics, with mostly TSV data on the order of hundreds of GB. Even after subsetting to particular features of interest I still typically have files the size of several to several-dozen GBs. Just today I reduced my data to 27GB file for the analysis itself.

Now, I write code in R and awk. Even with HPC clusters, nobody wants to put a 27GB file into their R session, especially when it’s for a living. So my awk-fu is getting better, now juggling faster mawk (must be POSIX) and more expressive GNU awk (aka gawk). But I’m really starting to push the limits on these languages. I may try playing with dtplyr or dbplyr. Or…

So, obviously, I’m looking at Julia. It looks awesome! But most of the modules and approaches to data circle around in-memory data frames and such. I hardly found anything about processing files line-by-line, or streams like stdin and stdout (this is good start btw, line parsing is next). You know, like awk, or miller. Parsing, incidentally, was a no-brainer in awk and miller, but the main reason I didn’t get far with R. I’ve looked at some approaches in Python and I expect Julia to be similar here - if anyone has some Julia-specific resource on that, throw it my way, please.

To summarise - I can totally see use for a tool in Julia that would make replacing awk & the unix toolbox easy(ier). Be it a module or two for easier scripting, or a full-blown data-munging tool like miller. The author of miller even considered similar approach:

When I was first developing Miller I made a survey of several languages. Using low-level implementation languages like C, Go, Rust, and Nim, I’d need to create my own domain-specific language (DSL) which would always be less featured than a full programming language, but I’d get better performance. Using high-level interpreted languages such as Perl/Python/Ruby I’d get the language’s eval for free and I wouldn’t need a DSL; Miller would have mainly been a set of format-specific I/O hooks. If I’d gotten good enough performance from the latter I’d have done it without question and Miller would be far more flexible. But low-level languages win the performance criteria by a landslide so we have Miller in Go with a custom DSL.

But he didn’t know Julia. Actually, it looks like he still doesn’t know. Maybe I will introduce him, it could be a fun Xmas project… :grin:

3 Likes

If you’re working with delimited data like TSV and CSV, then you can work with CSV.Rows object. It can iterate through rows of a file with only minimum data in memory.

2 Likes