Regx what wrong?

Why last row is wrong ?

v=["sp. z o.o. asdas"
"sp. z o.o asdasd" 
"sp. z oo asdas"
"sp. zo.o. asdasd"
"sp. zoo. asddfa"
"sp. z o.o. afdasf"
"sp zoo. afdasf"
"sp.zoo. afdasf"
"spzoo afdasf"]

julia> occursin.(r"sp.+z.+o", v)
9-element BitArray{1}:
 1
 1
 1
 1
 1
 1
 1
 1
 0

Thx Paul

+ is 1 or more times. Between sp and z there is 0 times anything, so it fails.
You may use * for 0 or more times, like:
occursin.(r"sp.*z.+o", v)

4 Likes

Thanks!
but now is 1 row more : “zespol szkol w przem” , the last one. This solution with las row is wrong. How to find only simillar sp. z o.o. I thing space betewen Chars must be no loenger then 1-2 place. How to do ?

Thanks, stars !
But julia> v=["sp. z o.o. asdas"
       "sp. z o.o asdasd"
       "sp. z oo asdas"
       "sp. zo.o. asdasd"
       "sp. zoo. asddfa"
       "sp. z o.o. afdasf"
       "sp zoo. afdasf"
       "sp.zoo. afdasf"
       "spzoo afdasf"
       "zespol szkol w przem"]

julia> occursin.(r"sp.*z.+o", v)
10-element BitArray{1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

Have you tested this on Regex101.com? Always super helpful for debugging regex

1 Like

thanks I still practice it at https://regexr.com/ but it’s not easy
:slight_smile:
Paul

What is exactly your desired outcome?
E.g. . is part of your regex meaning “any character” and part of the strings in the array. So it’s not clear what you want to match.

You mean that they must start with “sp”? In this case, just use ^ as the first character of the regex. This means the regex must match from the start of the string. “zespol szkol w przem” is currently being matched because the substring “spol szko” matches (i.e., “zespol szkol w przem”).

1 Like

See

1 Like

Thanks, ^ works… but the fraze can be evrywhere…
At the monet I gave new patern: r"sp.+z.+o.{1}" works wrong in 5 and 6 rows… Please


julia> v=["sf asd sp. z o.o. asdas "
       "dfs sp. z o.o asdasd "
       "ss sp.zoo. afdasf"
       "ssssp.zoo. afdasf"
       "ss spzoo afdasf"
       "ss  ds zespol szkol w przem"]
6-element Array{String,1}:
 "sf asd sp. z o.o. asdas "
 "dfs sp. z o.o asdasd "
 "ss sp.zoo. afdasf"
 "ssssp.zoo. afdasf"
 "ss spzoo afdasf"
 "ss  ds zespol szkol w przem"

julia>

julia> occursin.(r"sp.+z.+o.{1}", v)
6-element BitArray{1}:
 1
 1
 1
 1
 0
 1

I am not sure what is your question. Are you asking why $ makes the regex to not match any of the strings? This happens because you have defined that the second-to-last character is an o, what is not true for any of the strings. Did you mean to use r"^sp.*z.+o.*$"? I do not think there is a reason to add an $ if you gonna use a .* (or .+) after it.

1 Like

In my language is very importand offical shortcut : “sp. z o. o.” bat people makes many mistakes ;)I have to find every mistake combination like:: spzoo sp. zoo …
At the moment i found this solution

Why works wrong with 10 rows? (row 9 a can remove in second step)

julia> v=["sf asd sp. z o.o. asdas "
       " dfs sp. z o.o asdasd "
       " ds sp. .z. oo asdas"
       "dsfs sp. zo.o. asdasd"
       "d sp. zoo. asddfa"
       "s sp. z o.o. afdasf"
       "ss sp.zoo. afdasf"
       "ssssp.zoo. afdasf"
       "ss spzoo afdasf"
       "ss  ds zespol szkol w przem"]


julia> occursin.(r"s?p.+z.+o.{0}",v)
10-element BitArray{1}:
 1
 1
 1
 1
 1
 1
 1
 1
 0
 1

Paul

https://regex101.com maybe try to debug with one of those online regex visualized editor

I think what you are looking for is

julia> spzoo = r"s[.\s]*p[.\s]*z[.\s]*o[.\s]*o\.?";

julia> occursin.(spzoo, v)
10-element BitArray{1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 0

The [.\s]* part allows for optional . and whitespace inbetween the characters. \.? is not strictly necessary for occursin but for the match it will return the trailing . if its in the string.

To see what this matches exactly:

julia> [match(spzoo, s).match for s in v if !isnothing(match(spzoo, s))] 
9-element Array{SubString{String},1}:
 "sp. z o.o."
 "sp. z o.o"
 "sp. .z. oo"
 "sp. zo.o."
 "sp. zoo."
 "sp. z o.o."
 "sp.zoo."
 "sp.zoo."
 "spzoo"

EDIT:
You might want to add i after the regex to make it case insensitive

spzoo2 = r"s[.\s]*p[.\s]*z[.\s]*o[.\s]*o\.?"i;
julia> occursin(spzoo2, "s Sp. Z o.O. afdasf")
true
1 Like

big help and good lesson Thanks

To many line …:confused: I need only first 5 lines

rx=r"\b\d{2}-\d{3}\b"

julia> baza1[occursin.(rx,baza1)]
1077386-eleme "45-367"
 "45-367"
 "a 45-367"
 "0 45-367 0" 
"a 45-367 b"
 "tel. 91-321 28 81"
 "58-531 54 52"
 "58-531 54 52"
 "58-531 54 52"
 "58-531 54 52"
 "58-531 54 52"
 "91-321 28 81"
 "12-289 13"
 "12-289 13 31"
 "12-289 13 32"
 "12-289 13 31"
 "12-289 13 31"
 "67-286 24 80"
 "12-289 13 31"
...

The regex below only matches if the dd-ddd is at most preceded by any character and a space and/or followed by a space and any character.

rx=r"^(. )?\d{2}-\d{3}( .)?$"

This is the pattern I have seen at least. In your initial regex you considered that the extremities could only have base 10 digits (this is what the \d means), but in the 5 first lines you have lines in which the extremities are letters like a and b (should this be hex?).

W dniu 2020-09-06 o 16:53, Henrique Becker via JuliaLang pisze:

hex no! All data are just string UTF8

Paul

(I moved this to Offtopic since the discussion is about construction regexs, not Julia code.)

2 Likes

That… was not what I meant. I was asking if you considered a and b to be digits (as you were trying to match them with \d) because the numbers were in base 16 instead of base 10.