What is regex?

string-dot-byte · September 7, 2020, 6:25pm

Hello,

I have come across many programming terms inidcating regex which I am unsure of what it is.

I have made research on this forum and on google but have not found any good explanation about the subject.

What I understood

Regex is some sort of string manipulation.

using Pkg
Pkg.add("Printf")

@printf("%s\nThis is a float: %f\nThis is an integer: %d", "This is a string", 4, 9.4)

Although some say I am incorrect and regex is something way different than what I understood.

If you may possibly redirect me to useful documentation on regex that would be great. Preferably one explaining it’s wide use in programming.

Henrique_Becker · September 7, 2020, 6:43pm

Regex is a abbreviation of “Regular Expression” the term comes from the theoretic area of formal languages.

The example you give with printf has nothing to do with regex as far as I know.

Formally, a regular expression defines a language, this is a finite or infinite set of “words” that are all strings that pertain to that language or “match” the regex as it is often said.

One example is:

julia> regex = r"[1-9]+"
r"[1-9]+"

julia> m = match(regex, "abc 123 cde")
RegexMatch("123")

julia> m.match
"123"

The regex defines the language of all “words”/strings in which the digits from one to nine are repeated one or more times (i.e., the language of the positive numbers base 10, note this is an infinite set). You can then use it to find the first (or all) parts of a string that match that regular expression.

Regular expressions are often used to parse text that was not made to be parsed by computers, but instead understood by humans, but in the end the text follows a strict enough pattern to be possible to automate the extraction of some information from it. For example, make a script search the text output of another program/script and use some value it outputs in its own computations.

Henrique_Becker · September 7, 2020, 6:45pm

I have recategorized your question as off-topic because Regex is not a Julia-exclusive concept.

This is the Julia manual section about regular expressions.

string-dot-byte · September 7, 2020, 6:54pm

The example you give with printf has nothing to do with regex as far as I know.

Well, I thought something like match("^%d+$") would fall in the category.

So that means that it’s used for matching strings or finding strings within another in a more simple way? If I were to have a long script and I am trying to find a certain keyword placed next to another, regex would come in useful?

Example trying to compare a digit with another?

I have recategorized your question as off-topic because Regex is not a Julia-exclusive concept.

Which means regular expressions are not present in scripting languages? As far as I know most scripting languages do not have regex.

Henrique_Becker · September 7, 2020, 7:09pm

Did you mean the opposite? Because Python, Ruby, Javascript, Perl, all of them have RegEx in their standard libraries. Could you point some mainstream scripting language of the last 20 years that do not have RegEx in their standard library?

Yes, you could use regular expressions in such case. However, the number of different formats for printf is relatively limited, their structure is very simple, and each of them basically needs to be treated in a different way, so I am not sure if that code uses RegEx inside it. Anyway, it is not an example of the use of RegEx but of an standard library method that may be using RegEx inside it.

It is a compact way, and after you learn/master them, it is a very easy way to search for certain patterns. However, they cannot represent every possible pattern (the theory of formal languages very clearly specifies what are their limitations) and it is common to find people criticizing their readability (I often write comments above them explaining them because, otherwise, months later I will have a hard time remembering what they were supposed to parse).

Sorry, I am not sure I understand which example you are asking me to do.

string-dot-byte · September 7, 2020, 7:55pm

As far as I know most scripting languages do not have regex.

There’s a difference between programming and scripting languages. That is why I said scripting languages. I also do not think JavaScript or Ruby fall in that category.

I’ll try and explain this simply

Scripting    |   Programming 
---------------------------
Interpreted  |   Compiled
Lightweight  |   Bloated
Relys on C   |   Relys on asm
Weakly Typed |   Strongly Typed 
Runs With    |   Runs Alone

Code written in scripting languages can often be compiled into a native executable (either JIT-compiled or AOT-compiled)

JIT is another story though I don’t want to get into it cause i dont know much about it

I might be wrong, feel free to correct me.

An example would be Lua, it does not have regex but uses another system similar to julia if I’m correct. I’m still new to julia and unsure on how it works.

Anyways, I appreciate the reply and thanks for clarifying my uncertainty. Forget that last statement I made you have quoted on your post.

Henrique_Becker · September 7, 2020, 8:26pm

I do not think your definition of scripting languages is widely used.

I opened the five first results of the google for “programming languages vs scripting languages” and they agree that the main distinction is that scripting languages are either interpreted or “do not require an explicit compilation step” (Julia would end in this camp), one result criticized the use of the terms and said they were a mess but basically made the same point as the others. I also do not like the terms. Make more sense to me to talk about dynamically typed, or interpreted languages, these terms are better defined.

By this definition JavaScript and Ruby are scripting languages, by your own definition JavaScript is scripted and Ruby is in the middle.

Lua has patterns that are basically regular expression but they do not implement all the POSIX standard (to keep the language lightweight), use their own syntax at some points, and insist in using the term pattern everywhere probably just to avoid people pointing out they do not implemented this or that.

Julia uses a standard of regular expressions (that comes from Perl) which is also not exactly the POSIX standard, but it is very well-known, if I am not wrong it is more powerful than regular expressions (or the standard POSIX ones at least), and they do not shy away from calling them regular expressions.

Tamas_Papp · September 8, 2020, 6:59am

I pretty surprised by this, as the Wikipedia page seems to be in the top 3 hits on most search engines for regex. In any case, just read

I think that this categorization is not very meaningful these days.

In any case, regular expressions are supported by all modern languages in some form (either in the core language, or the library), so it is also not very relevant to this topic.

xiaodai · September 8, 2020, 7:02am

Yeah, so regex is definitely not a julia exclusive concept. So google would be a better option here.

Including a “Let me google that for you” like this is usually considered rude, so I won’t do it officially.

apo383 · September 9, 2020, 7:34pm

I believe I see where OP is coming from, remarking on matching. I think it’s quite understandable to wonder how printf and a regexp are similar or different. The answers given already are correct, but I hope to provide a more basic overview.

It might help to distinguish between substitution patterns and rexexp. A substitution pattern is a way to communicate a context switch. Your example

@printf("%s\nThis is a float: %f\nThis is an integer: %d", "This is a string", 4, 9.4)

has a string that should mostly be printed literally, but also has some tokens like %s, %f, %d that mean “don’t print this literally, but instead substitute each token with the arguments following this string,” which should be a string, a float, and an integer. The tokens represent a switch from literal text to something else.

A regexp has some similar features, in that there are tokens that switch context to something else. The shell command ls *.txt or dir *.txt says, list all the file (or directory) names that end with literal .txt, but start with anything. The * means match anything from the list produced by ls. There is a whole language of fancy things that could be matched, as indicated by @Henrique_Becker.

The question of scripting vs. “programming” language is interesting. There is no hard distinction to be made, but one could over-generalize that regexps are somewhat more applicable to scripting, in that scripts are often written quickly and casually, and are often meant to interact with files and such where there may be a pattern to their names. Many a scripter will quickly whip up a regexp to get something done.

In “real programming,” one might be more cautious with regexps, because there is a lot of opportunity for things to go horribly wrong, and create security holes and such. One would either avoid regexp, and/or test them carefully. But I am sure there are many examples counter to this generalization, plus no definition of real vs. casual programming.

pixel27 · September 10, 2020, 1:02pm

I kind of find this interesting, I have found that REGEX provides a nice way of validating input provided by a user. You can ensure that the input is the correct length (min/max) and that it contains the correct characters or even the correct mix of characters.

Topic		Replies	Views
Learning Regular Expressions in Julia Teaching & Outreach question , documentation , regex	13	1976	September 10, 2023
Regular Expression Data	4	872	April 20, 2017
What is regex for the end of string in Julia? General Usage strings , regex	7	558	July 29, 2023
ANN: RegularExpressions.jl Package Announcements	5	1305	April 15, 2019
Regex assistance converting from R to Julia General Usage question	12	395	April 13, 2024

What is regex?

Related topics