Removing characters from String

LinasBa · January 26, 2021, 9:42am

Hello,

I would like to remove some characters from String, but that seems to be really complicated in Julia
Example:

str = "ž1ž1"
r = r"1"
idx = eachmatch(r, str) .|> o -> o.offset
result = String(deleteat!(collect(str), idx))

I’m used to working with strings as character arrays, without having to worry about encoding bytes. Is there a simple way to achieve this, or should I move my text processing tasks back to C#?

oheil · January 26, 2021, 9:51am

If you already use a regular expression you may find replace usefull.
Your above code doesn’t work for me, but if you want to remove all “1” in your str it would be as easy as:

result = replace(str,r => "" )

LinasBa · January 26, 2021, 10:47am

Thanks for suggestion, but unfortunately that wouldn’t work for me, since I want to remove only some parts of regex match.
Basically I want to get code posted above to work with utf-8 strings.
The problem arises from RegexMatch offset field being starting byte number rather than character.
So collect(::String) has different indices than String.
For example:

julia> str = "ž1ž1"
"ž1ž1"

julia> str[3]
'1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)

julia> collect(str)[3]
'ž': Unicode U+017E (category Ll: Letter, lowercase)

oheil · January 26, 2021, 11:04am

You should explain this for better suggestions for your real problem.

So far, as it is still not clear, what exactly you try to do, I stay with guessing and take it as a given, that it is needed to work on the array of characters collect(str) and for this I propose this line:

cs=collect(str)
result=deleteat!(cs,findall(x -> x != nothing, match.(r,string.(cs))))

But I am also sure, that this is not what you should use, whatever real problem you try to solve.

kristoffer.carlsson · January 26, 2021, 11:05am

See Strings · The Julia Language

It is a bit hard to understand the exact requirements you have. You could do something like:

str = "ž1ž1"
r = r"1"
offsets = [m.offset for m in eachmatch(r, str)]
sprint() do io
    for i in setdiff(eachindex(str), offsets)
        print(io, str[i])
    end
end

LinasBa · January 26, 2021, 11:13am

Real problem:

str = "aa <12> bb <c> ąą <123> dd"
r = r"<.+?>"
mask = BitArray((0, 1, 1))
m = collect(eachmatch(r, str))[mask]
idx = (m .|> 
    o -> [o.offset, o.offset+length(o.match)-1]) |> 
    o -> reduce(vcat, o)
res = String(deleteat!(collect(str), idx))

"aa <12> bb c ąą <13> d"

“<>” is correctly removed from “<c>”, but not from “<123>”

DNF · January 26, 2021, 11:18am

 replace(str, r"<(\w+)>"=>s"\1")

?

rikh · January 26, 2021, 11:22am

replace accounts for the discrepancy you point out and works as it would on normal strings:

julia> s = "ž1ž1";
"ž1ž1"

julia> replace(s, "1" => "2")
"ž2ž2"

Regexes work fine with replace too. For example, to get rid of whitespace behind ž1:

julia> s2 = "ž1 ž1";

julia> rx = r"ž1[\W]?";

julia> replace(s2, rx => "ž1")
"ž1ž1"

LinasBa · January 26, 2021, 11:34am

I can’t solve this problem only with regex replace since, as shown in my newer example, filtering depends not only on input String but also on BitArray mask. This mask is constructed based on RegexMatch fields and external data.

DNF · January 26, 2021, 11:42am

It’s not at all clear to me what you are looking for. Can you give an example with correct expected input and output, that cannot be achieved with regex?

oheil · January 26, 2021, 12:31pm

You may provide us with your simple C# solution and we do the translation.

kristoffer.carlsson · January 26, 2021, 12:41pm

If you just want to work on bytes you should use codeunits instead of collect:

str = "aa <12> bb <c> ąą <123> dd"
r = r"<.+?>"
mask = BitArray((0, 1, 1))
matches = collect(eachmatch(r, str))[mask]
ranges_to_remove = [m.offset:(m.offset + sizeof(m.match)) for m in matches]
ranges_to_keep = setdiff(1:sizeof(str), reduce(vcat, ranges_to_remove))
String(codeunits(str)[ranges_to_keep])

# "aa <12> bb ąą dd"

LinasBa · January 26, 2021, 1:17pm

codeunits solves this problem.
Fixed my example:

str = "aa <12> bb <c> ąą <123ą> dd"
r = r"<.+?>"
mask = BitArray((0, 1, 1))
m = collect(eachmatch(r, str))[mask]
idx = (m .|> 
    o -> [o.offset, o.offset+sizeof(o.match)-1]) |> 
    o -> reduce(vcat, o)
res = String(deleteat!(collect(UInt8, codeunits(str)), idx))

This does introduce considerable amount complexity (compared to languages where string is indexed as char array and as not byte array).
It would be nice to be able to write code like:

julia> str = "žž"
"žž"

julia> length(str) >= 2 && str[2] == "ž"
ERROR: StringIndexError("žž", 2)

oheil · January 26, 2021, 2:05pm

No, it is only you, who introduces complexity without explaining why and where it comes from.
E.g.: your BitArray already knows how many matches the regex finds? Or is this something special, and the real thing is much more complex…

Sometimes trying to help gets tedious…

may be just me, having a bad day, sorry…

LinasBa · January 26, 2021, 2:42pm

Sorry for my inability to provide concise explanation. I was trying extract relevant parts from rather big amount of code. Amount of matches (or BitArray size) is known prior matching regex pattern. It’s contents depend on match data and other variables, so it can’t be computed by regex engine or before matching.

johnmyleswhite · January 26, 2021, 3:04pm

Potentially relevant context on C#: https://news.ycombinator.com/item?id=6602746

Topic		Replies	Views
Remove unmatched parts via regular expression New to Julia strings , regex	6	1374	December 31, 2021
Regex on byte vector General Usage regex	10	1596	November 10, 2020
Regular expressions returning offsets in bytes not characters General Usage question , regex	9	1278	July 7, 2017
Purging utf-8 bad characters General Usage	10	3584	April 21, 2018
Questions about string New to Julia question , strings , regex	21	1049	January 16, 2023

Removing characters from String

Related topics