I wanted for example to do something similar to this Python script.
script.py
import sys
for line in sys.stdin:
line = line.upper()
cur_name, cur_symb, _, value = line.split()
cur_symb = cur_symb.strip('()')
print(f"{cur_name:>10} {value:>3} {cur_symb}")
#print(line, end='')
with data.txt
bitcoin (btc) : 5
euros (€) : 100
dollars ($) : 80
which can be called using:
cat data.txt | python script.py
and output:
BITCOIN 5 BTC
EUROS 100 €
DOLLARS 80 $
and finally did this Julia script:
script.jl
for line in readlines(stdin)
line = uppercase(line)
cur_name, cur_symb, _, value = split(line)
cur_symb = strip(cur_symb, ['(', ')'])
println("$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
end
which can be called using
cat data.txt | julia script.jl
Maybe we should write a kind of tutorial about data cleaning with Julia (taking inspiration from the large number of tutorials about data cleaning teaching how to use iconv
, head
, tail
, tr
, wc
, split
, sort
, uniq
, cut
, paste
, join
, grep
, sed
, awk
, … and show that we can do all these tasks with ONE tool: Julia (I know it’s not the Unix philosophy)
We should also probably make a command line tool for that purpose which could be used to write oneliners like
cat data.txt | jsed "println(uppercase(line))"
or more complex process on stream of lines.
Commands could be given like
cat data.txt | jsed "line = uppercase(line)" "cur_name, cur_symb, _, value = split(line)" "cur_symb = strip(cur_symb, ['(', ')'])" "println(\"$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)\")"
or using multiline
cat data.txt | jsed """
line = uppercase(line)
cur_name, cur_symb, _, value = split(line)
cur_symb = strip(cur_symb, ['(', ')'])
println("$(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
"""
A “specialised” Julia template could be set using a parameter passed jsed
but in most case it shouldn’t be necessary.
What is your opinion about such a tool?
PS: maybe we should have something like
for (count, line) in enumerate(readlines(stdin))
line = uppercase(line)
cur_name, cur_symb, _, value = split(line)
cur_symb = strip(cur_symb, ['(', ')'])
println("$(count) $(lpad(cur_name, 10)) $(lpad(value, 3)) $(cur_symb)")
end
so we can process differently headers than data / skip headers using count so template will be
for (count, line) in enumerate(readlines(stdin))
...
end