Multiway string substitution

I have set an ambitious goal of modeling some Wolfram Physics Project features into Julia. Because actually, Julia works faster and more efficiently, while the subject is very interesting.
A bit later I will be happy to present a beta version of a small package in which everyone can participate.

In the meantime, I’d like to ask a few questions.

One of the key concepts in WPP is the Multiway system. Its essence is simple.

Let there be an initial string "A" and dictionary of rules: ("A" => "BBB", "BB" => "A").
Using them, we first get an obvious result: “A” → “BBB”
But the next step we can do in several ways (greedy method, first occurence method etc), of which I am interested in Multiway method, when we collect all possible results, so: “BBB” → [“AB”, “BA”] (this is like the Multiverse).

Looks like Julia’s built-in features don’t allow this. We will have to make an array from the string and analyze it manually. In this case, probably should abandon the strings at all?
But maybe I don’t know all the possibilities, so I’ll take any advice.

1 Like

Here’s how to collect all possible matches.

julia> matches = collect(eachmatch(r"BB", "BBB", overlap=true))
2-element Vector{RegexMatch}:
 RegexMatch("BB")
 RegexMatch("BB")

julia> matches[1].offset
1

julia> matches[2].offset
2

Don’t know about carrying out the replacements, though.

Thank you, but I’d like to avoid Regex. Because of the difficulties with the replacement and because it is a completely different ideology

Regex are very good and fast way to match strings.

Hmm, ok, may be I should do it with Regex, but what about replacement (and to collect different ways)?
And! What is maximal string length that not overload Regex searching?

I am also not aware of any Julia built-in feature that supports this.

However, you might achieve something quite elegant with a little tinkering and the right data structures.

I suggest taking a look at this.

I would also consider a domain modeling approach where you can create your own data type and maybe implement the AbstractTrees interface over it.

As a general rule, if performance is paramount and you have a limited set of rules, and all the key / replacement values are known at the compile time, then you might want to give up the String anyway and replace it with a more memory friendly type (you might still generate a String output as final result).

1 Like

This package may also provide some inspiration: GitHub - BioJulia/Automa.jl: A julia code generator for regular expressions

I’m not sure it does what you want, but using similar techniques to write a “Rule to Julia” compiler might be a performant (though not necessarily simple) way to do what you want.

It’s not difficult to do the replacements once the matches are found, with a little code:

function replacements_all(a::AbstractString, b::Pair{Regex, <:SubstitutionString})
    (r, s) = b
    matches = eachmatch(r, a, overlap = true)
    results = String[]
    for m in matches
        s1 = @views a[1:(m.offset - 1)]
        s2 = @views a[m.offset:end]
        s2replaced = replace(s2, r => s, count = 1)
        push!(results, s1 * s2replaced)
    end
    results
end

Test:

julia> replacements_all("BBBB", r"BB" => s"A")
3-element Vector{String}:
 "ABB"
 "BAB"
 "BBA"

There is some inefficiency here since replace searches the regex pattern again.

1 Like

Yes, your answer is very close to what I’ve found.
AbstractTrees is musthave for Multiway System, thank you!
Sooner I’ll publish a small “SubstitutionSystem” project and will be glad of your advice!

1 Like

After thinking, I realized that I would have to use Regex, because it is difficult to organize a proper search in Vector

function multisbs(str,sstr, s)
    res=String[]
    fp=findfirst(str, s)
    while !isnothing(fp)
        push!(res,s[1:first(fp)-1]*sstr*s[last(fp)+1:end])
       #push!(res,s[1:fp.start-1]*sstr*s[fp.stop+1:end])
        fp=findnext(str, s,first(fp)+1)
    end
    res
end

julia> multisbs("BBA","A","BBABBA")
2-element Vector{String}:
 "ABBA"
 "BBAA"

julia> multisbs("BB","A","BBABBA")
2-element Vector{String}:
 "AABBA"
 "BBAAA"

julia> multisbs("BB","A","BBBBA")
3-element Vector{String}:
 "ABBA"
 "BABA"
 "BBAA"

julia> multisbs("BB","A","BBBBB")
4-element Vector{String}:
 "ABBB"
 "BABB"
 "BBAB"
 "BBBA"

This is really great! Now I’m trying to adapt it for vectors,
but the way findfirst works for Vector bother me (

what do you mean exactly? vector{UInt8} instead of strings or something?

The problem is just find sub-vector in vector! How to make my_findfirst_vec,
so my_findfirst_vec([1, 1], [0, 1, 0 , 0, 1, 1, 1, 1]) return 5:6 (0_0)
I cannot find such recipe in community

if you damain is contained in UInt8, you can do

findfirst([0x1, 0x1], UInt8[0, 1, 0 , 0, 1, 1, 1, 1]) 

or more generally

using IterTools
findfirst(==((1,1)), collect(partition([0, 1, 0 , 0, 1, 1, 1, 1],2,1))) 
function searchfirstsubvec(v,sv)
    l=length(sv)-1
    for i in eachindex(v)
        sv==v[i:i+l] ? (return i:i+l) : continue
    end
end
1 Like

If you know all replacement rules in advance the data structure you’re looking for is Aho-Corasik automaton (sometimes known as index automaton). It allows e.g. to find all matches in time linear with the length of the input string.

However it is not clear to me how do you derive A[AB, BA]. To me your rules generate an infinite chain (or rather a tree) of replacements which you could in theory traverse breadth-first. The question is what do you want to obtain?

(to avoid infinite chains/branches one usually adds shift-invariant well-order on the free monoid of strings (e.g. shorter-then-lexicographical) – this of course is not trouble free, as choosing the right order for your application is a problem of its own :wink:

I’m going to implement functions SubstitutionSystem and MultiwaySystem from Wolfram ecosystem on Julia.
In some ways they are similar to Aho–Corasick algorithm, but in many ways they are not.

First of all, there are no problems with infinite string expansion and numbers of replacement results.
In the near future I will post a link to Git, and I really hope that Julia’ community will help me to create a good package!

Aho-Corasik is just (just?!) a data structure that allows you to find all occurrences of LHSes in a given input string in optimal time. Here by “string” I mean anything that is linearly ordered. I’m pretty sure Mathematica uses it behind the scenes to describe the “evolution of MultiwaySystem”. After reading their docs it’s still very much unclear to me what does it really mean, but hey, maybe your package is going to do a better job! I’m looking forward to reading it!