I would like to remove some characters from String, but that seems to be really complicated in Julia
Example:
str = "ž1ž1"
r = r"1"
idx = eachmatch(r, str) .|> o -> o.offset
result = String(deleteat!(collect(str), idx))
I’m used to working with strings as character arrays, without having to worry about encoding bytes. Is there a simple way to achieve this, or should I move my text processing tasks back to C#?
If you already use a regular expression you may find replace usefull.
Your above code doesn’t work for me, but if you want to remove all “1” in your str it would be as easy as:
Thanks for suggestion, but unfortunately that wouldn’t work for me, since I want to remove only some parts of regex match.
Basically I want to get code posted above to work with utf-8 strings.
The problem arises from RegexMatch offset field being starting byte number rather than character.
So collect(::String) has different indices than String.
For example:
You should explain this for better suggestions for your real problem.
So far, as it is still not clear, what exactly you try to do, I stay with guessing and take it as a given, that it is needed to work on the array of characters collect(str) and for this I propose this line:
cs=collect(str)
result=deleteat!(cs,findall(x -> x != nothing, match.(r,string.(cs))))
But I am also sure, that this is not what you should use, whatever real problem you try to solve.
It is a bit hard to understand the exact requirements you have. You could do something like:
str = "ž1ž1"
r = r"1"
offsets = [m.offset for m in eachmatch(r, str)]
sprint() do io
for i in setdiff(eachindex(str), offsets)
print(io, str[i])
end
end
I can’t solve this problem only with regex replace since, as shown in my newer example, filtering depends not only on input String but also on BitArray mask. This mask is constructed based on RegexMatch fields and external data.
It’s not at all clear to me what you are looking for. Can you give an example with correct expected input and output, that cannot be achieved with regex?
str = "aa <12> bb <c> ąą <123ą> dd"
r = r"<.+?>"
mask = BitArray((0, 1, 1))
m = collect(eachmatch(r, str))[mask]
idx = (m .|>
o -> [o.offset, o.offset+sizeof(o.match)-1]) |>
o -> reduce(vcat, o)
res = String(deleteat!(collect(UInt8, codeunits(str)), idx))
This does introduce considerable amount complexity (compared to languages where string is indexed as char array and as not byte array).
It would be nice to be able to write code like:
No, it is only you, who introduces complexity without explaining why and where it comes from.
E.g.: your BitArray already knows how many matches the regex finds? Or is this something special, and the real thing is much more complex…
Sorry for my inability to provide concise explanation. I was trying extract relevant parts from rather big amount of code. Amount of matches (or BitArray size) is known prior matching regex pattern. It’s contents depend on match data and other variables, so it can’t be computed by regex engine or before matching.