Parse structured string to dictionary

strings

#1

Let’s say I have a string which has the following template structure:
"a_{a}_b_{b}_c_{c}.{txt}.{ext}"

Example: “a_3_b_4_c_5.whatever.jld”

My question is how to conveniently generate such string from given (string) values for a, b, c, txt and ext and (more importantly) how to parse them back into a dict with keys a,b,c,txt and ext.

In Python, this is how I would do it:

import parse
templ = "a_{a}_b_{b}_c_{c}.{txt}.{ext}"

# generating
a = "3"
b = "4"
c = "5"
txt = "whatever"
ext = "jld"
s = templ.format(a=a, b=b, c=c, txt=txt, ext=ext)
# s == 'a_3_b_4_c_5.whatever.jld'

# parsing
dict = parse.parse(templ, s)
# dict == <Result () {'a': '3', 'c': '5', 'txt': 'asd', 'b': '4', 'ext': 'txt'}>
dict["a"] # == "3"
dict["ext"] # == "jld"

In Julia:

# generating
a = "3"
b = "4"
c = "5"
txt = "whatever"
ext = "jld"
s = "a_$(a)_b_$(b)_c_$(c).$(txt).$(ext)"

# parsing
# this is my question

How do do the parsing part nicely? Of course I could manually use split etc. but this seems rather undconvenient if you have to do it more often and also it doesn’t generalize. Is there a package for this?


#2

I think that split is a very convenient solution, eg

julia> function parse1(s)
           pairs, txt, ext = split(s, ".")
           dict = Dict(Iterators.partition(split(pairs, "_"), 2))
           dict, txt, ext
       end
parse1 (generic function with 1 method)

julia> s = "a_3_b_4_c_5.whatever.jld"
"a_3_b_4_c_5.whatever.jld"

julia> parse1(s)
(Dict("c"=>"5","b"=>"4","a"=>"3"), "whatever", "jld")

#3

Thanks for your answer. However, you encode the template information in your parse1 function, which I would take as unconvenient compared to the nice python version above.

As a consequence, it also doesn’t generalize nicely. Imagine another string with completely different structure. You would have to redefine a version of parse1 all the time, while in python I just define the new template (just one - very natural - line).


#4

I see what you want now, sorry I did not get it the first time.

I am not aware of a package that does this. However, you should be able to do the following fairly easily:

  1. write a function that parses a template to a regexp and a vector of keys for each position,
  2. wrap it in a structure,
  3. define a function that uses the regexp to capture the matches, then generate the dictionary.

If you need help, please ask here.


#5

This is rudimentary and could use some refinements, but basically works:

struct TemplateParser
    pattern::Regex
    names::Vector{Symbol}
end

function Base.parse(tp::TemplateParser, s)
    m = match(tp.pattern, s)
    m == nothing && error("no match")
    Dict(zip(tp.names, m.captures))
end

function escape_regex(s)        # NOTE probably could use some work
    e = ""
    for c in s
        if c ∈ ['.', '*', '\\']
            e *= '\\'
        end
        e *= c
    end
    e
end

macro templateparser(s)
    s.head == :string || error("Use a string expression with interpolation")
    names = Vector{Symbol}()
    pattern = ""
    for arg in s.args
        if arg isa String
            pattern *= escape_regex(arg)
        elseif arg isa Symbol
            pattern *= "(.*)"
            push!(names, arg)
        end
    end
    TemplateParser(Regex(pattern), names)
end

t = @templateparser "a_$(a)_b_$(b)_c_$(c).$(txt).$(ext)"

parse(t, "a_3_b_4_c_5.whatever.jld")

No doubt it has horrible corner cases :smile:


#6

Thanks! The macro is nice and clever! This was my attempt (I’m really really bad at string parsing/regexp etc.)

function myparse(tmpl::String, s::String)
	tmp = s
	kwds = matchall(r"(?<={).+?(?=})", tmpl)
	splits = [split(s, "}")[end] for s in split(tmpl, "{")]
	if splits[end] == ""
		splits[end] = "."
		tmp *= "."
	end

	vals = Vector{String}(length(splits)-1)
	for k in 1:length(splits)-1
		tmp = tmp[searchindex(tmp, splits[k])+length(splits[k]):end]
 		vals[k] = split(tmp, splits[k+1])[1]
 	end

 	return Dict(zip(kwds, vals))
end