Replacing grep, sed, awk scripts with Julia

Hello,

I have always considered grep, sed, awk scripts very difficult to read.
I wonder if someone have try to build a tutorial and tools to move from this kind of tools to a proper script language such as Julia.

Python have https://github.com/ksamuel/Pyped
see http://sametmax.com/remplacer-sed-awk-cut-et-perl-par-python-orgasme-pour-sysadmin/
see also https://codereview.stackexchange.com/questions/148547/replace-one-liner-sed-awk-with-python

but I’m pretty sure that Julia can have similar tools and could become a nice tool for data cleaning / data preprocessing…

Best regards

2 Likes

Not exactly an answer to your question but WordTokenizers.jl includes functionality to generate julia AST out of sed scripts.

1 Like

Julia has built-in perl-like regular-expression search and replace, so it is pretty straightforward to implement anything you would have done with a sed script.

2 Likes

As @stevengj said, it is usually very straightforward. Eg see an example of replacing multiple strings in my package generator skeleton.jl.

Usually, Julia code is more verbose than sed, awk, and other specialized tools, but I consider that an advantage: whenever I am using the latter, I search the net furiously for examples and end up with something brittle and cryptic.

2 Likes

I wanted for example to do something similar to this Python script.

script.py

import sys
for line in sys.stdin:
    line = line.upper()
    cur_name, cur_symb, _, value = line.split()
    cur_symb = cur_symb.strip('()')
    print(f"{cur_name:>10} {value:>3} {cur_symb}")
    #print(line, end='')

with data.txt

bitcoin (btc) : 5
euros (€) : 100
dollars ($) : 80

which can be called using:

cat data.txt | python script.py

and output:

   BITCOIN   5 BTC
     EUROS 100 €
   DOLLARS  80 $

and finally did this Julia script:

script.jl

for line in readlines(stdin)
    line = uppercase(line)
    cur_name, cur_symb, _, value = split(line)
    cur_symb = strip(cur_symb, ['(', ')'])
    println("$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
end

which can be called using

cat data.txt | julia script.jl

Maybe we should write a kind of tutorial about data cleaning with Julia (taking inspiration from the large number of tutorials about data cleaning teaching how to use iconv, head, tail, tr, wc, split, sort, uniq, cut, paste, join, grep, sed, awk, … and show that we can do all these tasks with ONE tool: Julia (I know it’s not the Unix philosophy)

We should also probably make a command line tool for that purpose which could be used to write oneliners like

cat data.txt | jsed "println(uppercase(line))"

or more complex process on stream of lines.

Commands could be given like

cat data.txt | jsed "line = uppercase(line)" "cur_name, cur_symb, _, value = split(line)" "cur_symb = strip(cur_symb, ['(', ')'])" "println(\"$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)\")"

or using multiline

cat data.txt | jsed """
line = uppercase(line)
cur_name, cur_symb, _, value = split(line)
cur_symb = strip(cur_symb, ['(', ')'])
println("$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
"""

A “specialised” Julia template could be set using a parameter passed jsed but in most case it shouldn’t be necessary.

What is your opinion about such a tool?

PS: maybe we should have something like

for (count, line) in enumerate(readlines(stdin))
    line = uppercase(line)
    cur_name, cur_symb, _, value = split(line)
    cur_symb = strip(cur_symb, ['(', ')'])
    println("$(count) $(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
end

so we can process differently headers than data / skip headers using count so template will be

for (count, line) in enumerate(readlines(stdin))
    ...
end

I don’t see the advantage compared to just writing a script.

For the same reasons why people are using sed, awk and other Bash tools instead of using a “complete” language.

1 Like

I’m trying

cat data.txt | julia -e "
for (count, line) in enumerate(readlines(stdin))
    line = uppercase(line)
    cur_name, cur_symb, _, value = split(line)
    cur_symb = strip(cur_symb, ['(', ')'])
    println(\"$(count) $(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)\")
end
"

but it doesn’t work.

-bash: count: command not found
-bash: command substitution: line 1: syntax error near unexpected token `cur_name,'
-bash: command substitution: line 1: `lpad(cur_name, 10)'
-bash: command substitution: line 1: syntax error near unexpected token `value,'
-bash: command substitution: line 1: `lpad(value, 3)'
-bash: cur_symb: command not found

Any idea?

sed, awk etc are themselves rather complex languages (if not as complex as Julia). AFAIK their main advantage is speed, availability, and familiarity (where applicable, if you already use these tools).

But for my own purposes, Julia is already available, it is fast, and I am familiar with it. Also, for complex operations, it has the advantage that I can modularize my code and unit test parts. I find this crucial, and this is why I am not that keen on doing complex things with command line tools.

In fact the problem was because $ need to be escaped with \ when using in command line.

WIP:

jsed.jl

code = join(ARGS, ";")
template = """
for (count, line) in enumerate(readlines(stdin))
    $(code)
end
"""
eval(Meta.parse(template))

Usage:

cat data.txt | julia jsed.jl """
line = uppercase(line)
cur_name, cur_symb, _, value = split(line)
cur_symb = strip(cur_symb, ['(', ')'])
println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")
"""

or as one-liner

cat data.txt | julia jsed.jl "line = uppercase(line)" "cur_name, cur_symb, _, value = split(line)" "cur_symb = strip(cur_symb, ['(', ')'])" "println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")"

or

cat data.txt | julia jsed.jl "line = uppercase(line); cur_name, cur_symb, _, value = split(line); cur_symb = strip(cur_symb, ['(', ')']); println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")"

but jsed need now to be more “generic”, accepting several kind of templates.

We also need to have jsed as a system command (ie available in path) so we could do:

cat data.txt | jsed "line = uppercase(line)" "cur_name, cur_symb, _, value = split(line)" "cur_symb = strip(cur_symb, ['(', ')'])" "println(\"\$(count) \$(lpad(cur_name, 10)) \$(lpad(value, 3)) \$(cur_symb)\")"
1 Like

I like your idea. :slight_smile:

But I would probably start with package as helper for this things. (You probably are going there with your WIP)

You could add next line into your .bashrc file:

alias jsed='julia jsed.jl'

Or you could use apostrophe:

$ echo 'hello ;)' | jsed '''
println("$line")
'''
hello ;)
1 Like

I completely agree with your points regarding data cleaning.
HOWEVER as a systems guy - we will continue to use bash/sed/awk and friends.
Why? You can usually expect to have these installed in a default Linux installation.
Julia is not a default install.
My argument falls apart for cross platform Windows and Linux of course!

Also systems guys are VERY wary of making systems dependent on any “non standard” versions.
I have a genuine war story of a $million SGI system which would not boot following an update of the bash version (not an official SuSE update). Cue me booting from a USB stick and reversing the update.

1 Like

I use the_silver_searcher which I really quite love.

1 Like

I kinda feel ripgrep obsoleted silver surfer.

4 Likes

Cool, I was not aware of this. Looks like there is even a vim plugin which was one of my main uses of silver searcher. Looking forward to experimenting with it.