Removing characters from String

Hello,

I would like to remove some characters from String, but that seems to be really complicated in Julia :frowning:
Example:

str = "ž1ž1"
r = r"1"
idx = eachmatch(r, str) .|> o -> o.offset
result = String(deleteat!(collect(str), idx))

I’m used to working with strings as character arrays, without having to worry about encoding bytes. Is there a simple way to achieve this, or should I move my text processing tasks back to C#?

1 Like

If you already use a regular expression you may find replace usefull.
Your above code doesn’t work for me, but if you want to remove all “1” in your str it would be as easy as:

result = replace(str,r => "" )
4 Likes

Thanks for suggestion, but unfortunately that wouldn’t work for me, since I want to remove only some parts of regex match.
Basically I want to get code posted above to work with utf-8 strings.
The problem arises from RegexMatch offset field being starting byte number rather than character.
So collect(::String) has different indices than String.
For example:

julia> str = "ž1ž1"
"ž1ž1"

julia> str[3]
'1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)

julia> collect(str)[3]
'ž': Unicode U+017E (category Ll: Letter, lowercase)

You should explain this for better suggestions for your real problem.

So far, as it is still not clear, what exactly you try to do, I stay with guessing and take it as a given, that it is needed to work on the array of characters collect(str) and for this I propose this line:

cs=collect(str)
result=deleteat!(cs,findall(x -> x != nothing, match.(r,string.(cs))))

But I am also sure, that this is not what you should use, whatever real problem you try to solve.

See Strings · The Julia Language

It is a bit hard to understand the exact requirements you have. You could do something like:

str = "ž1ž1"
r = r"1"
offsets = [m.offset for m in eachmatch(r, str)]
sprint() do io
    for i in setdiff(eachindex(str), offsets)
        print(io, str[i])
    end
end

Real problem:

str = "aa <12> bb <c> ąą <123> dd"
r = r"<.+?>"
mask = BitArray((0, 1, 1))
m = collect(eachmatch(r, str))[mask]
idx = (m .|> 
    o -> [o.offset, o.offset+length(o.match)-1]) |> 
    o -> reduce(vcat, o)
res = String(deleteat!(collect(str), idx))

"aa <12> bb c ąą <13> d"

“<>” is correctly removed from “<c>”, but not from “<123>”

 replace(str, r"<(\w+)>"=>s"\1")

?

1 Like

replace accounts for the discrepancy you point out and works as it would on normal strings:

julia> s = "ž1ž1";
"ž1ž1"

julia> replace(s, "1" => "2")
"ž2ž2"

Regexes work fine with replace too. For example, to get rid of whitespace behind ž1:

julia> s2 = "ž1 ž1";

julia> rx = r"ž1[\W]?";

julia> replace(s2, rx => "ž1")
"ž1ž1"
1 Like

I can’t solve this problem only with regex replace since, as shown in my newer example, filtering depends not only on input String but also on BitArray mask. This mask is constructed based on RegexMatch fields and external data.

It’s not at all clear to me what you are looking for. Can you give an example with correct expected input and output, that cannot be achieved with regex?

1 Like

You may provide us with your simple C# solution and we do the translation.

1 Like

If you just want to work on bytes you should use codeunits instead of collect:

str = "aa <12> bb <c> ąą <123> dd"
r = r"<.+?>"
mask = BitArray((0, 1, 1))
matches = collect(eachmatch(r, str))[mask]
ranges_to_remove = [m.offset:(m.offset + sizeof(m.match)) for m in matches]
ranges_to_keep = setdiff(1:sizeof(str), reduce(vcat, ranges_to_remove))
String(codeunits(str)[ranges_to_keep])

# "aa <12> bb ąą dd"
3 Likes

codeunits solves this problem.
Fixed my example:

str = "aa <12> bb <c> ąą <123ą> dd"
r = r"<.+?>"
mask = BitArray((0, 1, 1))
m = collect(eachmatch(r, str))[mask]
idx = (m .|> 
    o -> [o.offset, o.offset+sizeof(o.match)-1]) |> 
    o -> reduce(vcat, o)
res = String(deleteat!(collect(UInt8, codeunits(str)), idx))

This does introduce considerable amount complexity (compared to languages where string is indexed as char array and as not byte array).
It would be nice to be able to write code like:

julia> str = "žž"
"žž"

julia> length(str) >= 2 && str[2] == "ž"
ERROR: StringIndexError("žž", 2)

No, it is only you, who introduces complexity without explaining why and where it comes from.
E.g.: your BitArray already knows how many matches the regex finds? Or is this something special, and the real thing is much more complex…

Sometimes trying to help gets tedious…

:unamused: may be just me, having a bad day, sorry…

2 Likes

Sorry for my inability to provide concise explanation. I was trying extract relevant parts from rather big amount of code. Amount of matches (or BitArray size) is known prior matching regex pattern. It’s contents depend on match data and other variables, so it can’t be computed by regex engine or before matching.

Potentially relevant context on C#: https://news.ycombinator.com/item?id=6602746

1 Like