Technically not a regular expression, as (intuitively) the backreference has to “count” the number of 1’s that were matched. Regular expressions are equivalent to finite state machines, which cannot count arbitrarily high (any state machine that can count arbitrarily high needs an infinite number of states, since there are infinite natural numbers).
The 2022 discussion has a more formal proof using the pumping lemma.
Fun fact about the pumping lemma - my best friend's grandparent worked on it (Eli Shamir). When we were taught it during my bsc the TA who knew he was the grandson of one of the inventors, asked during the lecture where he was, if he could say something about his grandfather, and it turns out he played hockey from the class to work on something else, embarrassing!
The pumping lemma is a great milestone in this field, so simple, yet so powerful.
> The TA [...] asked during the lecture where he was, if he could say something about his grandfather, it turns out he played [hooky] from the class to work on something else, embarrassing!
If the "working on something else" means "working on some hard research problem", this is not embarrassing, but stylish. :-)
When programmers say regular expression they usually mean Perl Compatible Regular Expressions (PCRE). Or I guess PCRE-compatible expressions, since there are other implementations of the same syntax out there. If you go back far enough in time they were once regular expressions in the computer science sense, but then features got added, and now they are incredibly powerful but can take basically infinite memory and time.
Although general PCRE syntax has clearly won, that’s distinct from the infinite-memory-and-time matter. Engines that don’t use backtracking and can search in linear time (O(regex-size × search-text-size)) have been making a serious comeback in recent years, at only the cost of not supporting lookaround assertions or backreferences. For example, the re2 library, or Rust’s regex crate, both of which are very popular.
Alas that approach only works well for your basic regular expressions that support Kleene-Star, concatenation and union.
Regular languages are also closed under intersection and taking prefix etc (see https://en.wikipedia.org/wiki/Regular_language#Closure_prope...), and if you want to expose those operators in your regular expressions, Russ Cox's technique doesn't really work.
I suspect it should be possible to transform two intersected (K*,C,U) expressions into a new (K*,C,U) expression by walking over the two, dropping incompatible terms, reducing sets, and redistributing and expanding sub-expressions to combine the two, though I do not have time to take a serious look at this at the moment.
I also suspect that this might result in a combinatorial explosion of fused terms, but that's irrelevant as we're only worried about whether intersection can be implemented in Russ Cox's method, not whether a regular expression might require several earths or so worth of RAM to hold :)
Nothing in the wiki article mentions taking a prefix, but that just sounds like `(prefix|)`?
You can still explicitly build the finite automaton for your closure-extended regular language. That one might be big, but matching is still linear in the length of the text you are searching through.
> Nothing in the wiki article mentions taking a prefix, but that just sounds like `(prefix|)`?
They implicitly mention prefixes, when they talk about 'the trio operations: string homomorphism, inverse string homomorphism, and intersection with regular languages.' If I remember right, the prefix operation is:
Take a language L, restrict it to the strings that have the right prefix: `L' := L & (prefix.*)` then chop off that prefix from every member of L'.
> and now they are incredibly powerful but can take basically infinite memory and time.
Which is why the "RE" in the article is excrutiatingly slow, given that it needs to perform insane amounts of backtracking. In contrast, "real" regular expression checkers run in linear time.
Also, nitpicking, but they take unbounded time. It's still finite.
The inflationary use of "infinite" especially in tech crowds that should know better deserves pushback.
There are models of computation that actually model infinite computation. A well-studied one is Büchi automata, basically a generalization of finite state automata to infinite sequences.
> And I’m not sure about unbounded; surely it is bounded by a suitable exponential function?
Nope. You can nest them and for any function that you give you can produce a RE that takes longer than that function to evaluate. (That's literally what "unbounded" means. There is no bound. Doesn't mean it's "infinite". All those computations terminate, which is literally what "finite" means in this context.)
funnily enough the original POSIX regular expressions (BRE) were strictly less powerful than regular expressions (CS sense), lacking a general alternation operator (though one was provided in ERE)
Come now, every time a colleague says they parsed a file with a regex in javascript, do you turn to them and tell them "well, you didn't actually use a _regular_ expression..."
It's true, the article doesn't make that claim. We think of primes as being mysterious and out of reach, and we're conditioned to think "mathematical / computational breakthrough" when we read the words "prime number" in a headline, but this is only because we've found all of them that aren't huge enough to be mysterious. Computing prime numbers is trivial as long as you don't mind them being the same ones everybody else already found.
Not sure what you mean by "trivial", I mean a simple prime finding algorithm is easy to write down, but it will be very inefficient, especially for large primes (the very ones you use in cryptography).
Coming up with an efficient prime-finding algorithm like Miller-Rabin* is far from trivial.
* Technically it's just a prime-checking algorithm, but you can just generate random numbers until you've verified one of them is prime.
I do find it useful to know that regular expressions in JS(and many other languages really) are not actually regular and why - allows one to decide if a given problem can be solved at all using them.
> Regular expressions are equivalent to finite state machines, which cannot count arbitrarily high (any state machine that can count arbitrarily high needs an infinite number of states, since there are infinite natural numbers).
The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.
Informally, regular expressions like in Perl are a kind of string rewriting. And many string rewriting rules, if they can be iteratively applied, are Turing-complete (with the usual caveat about infinite storage and time). The usual definition of a regular expression excludes that kind of iterative application, though. It's just a matching pattern; rewriting is technically outside of the concept of a regular expression.
> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.
Right. And if you really want to push that argument further then there are numbers for which your machine can't decide whether they are prime or not. Most of them, in fact, i.e., for all but a finite number of primes and non-primes, your machine can't tell.
For the theory of computability, this is not a useful model. Neither for complexity theory. Statements like "My machine can sort in constant time" which is arguably true but not useful either. It only holds up to a certain collection size which in a sense makes it both nonsensical and useless. Even concepts like the Chomsky hierarchy with context free languages being one of the prominent components make little sense as long as your computational model is finite state.
Instead, we model them as unbounded, to be able to differentiate between classes of computation problems as well as reason about complexity. The Turing Machine is one of the simple and elegant models, but there are others for other purposes.
I wish that basic education about the theory of computation would stress more why theory treats things like this, to avoid this kind of confusion. (And then people would need to take those classes as well, and not just claim they can write React apps just fine without a degree.)
> And then people would need to take those classes as well, and not just claim they can write React apps just fine without a degree.
Weird tangent. They can write React apps without a degree, can't they? What's wrong with that?
The amount of complexity reasoning you need for something like that is around "this part seems slow, maybe there's a better algorithm somewhere" or "that's a lot of nested loops, maybe I can do with less". Knowing more might help in some pretty rare cases but doesn't seem like a requirement. On the other hand, knowing the complexity of all algorithms and blindly assuming "good complexity = good performance" while never checking is a recipe for desaster.
> On the other hand, knowing the complexity of all algorithms and blindly assuming "good complexity = good performance" while never checking is a recipe for desaster.
Exactly. That's the difference between an actual education and having looked up some Big-O notation on Wikipedia.
> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high
Come on, can we drop this useless trivia that floats around? No, computers are Turing-machines for all practical purposes and no the memory is not the tape, unless your computer is sandboxed from everything. As soon as it has some side effect channels, its “tape” can be as large as needed. For example, it can use the internet to store petabytes of state. But if we want to be more pedantic, it can control and observe physical things, making the whole universe its state space.
Which is still finite, so yeah, even more pedantically you are right due to mathematical infinity not existing, but the actual distinction of a Turing machine is “lazy”, if you don’t hit the limit, you may as well reason as if it were a Turing machine.
> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.
I'm not sure how that's relevant. Even given infinite space, there still wouldn't exist a Regular Expression (the CS kind) you could write down in finite time that would match only primes.
At least IIRC, it's been a 20+ years since I last took a Discrete Math course :)
> At least IIRC, it's been a 20+ years since I last took a Discrete Math course :)
Your memory is correct and there is very simple intuition for this. Regular languages are those that can be matched by a machine with bounded memory. So, let's say that bound is "n" bits. Then your machine will not be able to read in unary strings larger than 2^n.
> Even given infinite space, there still wouldn't exist a Regular Expression (the CS kind) you could write down in finite time that would match only primes.
The point is, neither would there exist a non-(Regular Expression) method that could do such, if you're running it on a finite state machine (your computer).
But that argument is not interesting. "We can't count arbitrarily high because the universe is finite." Yeah, duh. (Observable universe, for the pedants.)
It's more interesting to ask: suppose you did have a magical stick of RAM that had infinite storage, and suppose you augmented your computer to store and query that memory. Then -- yes -- your computer could determine whether any number, however large, is prime. An FSM could not.
If you have an infinite stick of RAM, regular expressions wouldn't be FSMs, either, they could have infinite state space. They would terminate in finite time that is exponential in the size of the prime but nevertheless would still halt.
A finite state machine can’t have infinite state space.. also, you can write a finite Turing program that does exactly that, but you can’t write a finite description of an “infinite” state space, which is the whole point.
Ad absurdum everything would be computable (recognizable) by FSMs, simply by listing all the infinite number of possibilities. listOfAllPrimes.concat(“|”), here you are..
The point being made is that “regular expressions”, in the original CS meaning, do not have stack space or anything analogous. “Regular expressions” in modern programming practice are often different and more powerful, of course.
There is, the property of being a Turing machine would only be thrown out when you actually hit the limit. Besides computers being able to modify state not only in their memory but e.g. outside memory, I can also look at my computer as a lazy-taped Turing machine that chugs along writing only to its memory, and when it were to go off it pauses and waits for me to add additional memory.
This is just useless pedantism that diminishes the very imporrant categorization of Chomsky.
Personally, I use "regex" for the informal string-matching/rewriting engines found in most languages and "regular expression" for the formal definition.
I don't think the GP is the original poster, so them claiming how they "personally" differentiate between the two does not make the submission title a lie.
A finite self referential algorithm that checks all numbers for being prime does exist. Running it on a finite computer will eventually hit the bounds of the computer, but that is a limitation outside of the algorithm.
A finite regular expression that checks all numbers for being prime cannot exist already in principle.
Technically, a computer is also just a large finite state machine. Which suggests computability theory is unable to accurately account for the intuitive difference between computers and simpler automatons. I suspect it has something to to with the computational time/space complexity of the algorithms they are able to execute.
Very cool!
The top answer has a 12,731 character regular expression that matches base 10 numbers that are divisible by 7. (There is an answer with only 105 characters, but that uses .NET "regular expressions" which have some pretty gnarly extensions and are definitely not regular languages).
The regex itself is generated from a relatively simple DFA based on mod 7 arithmetic on digits using a program (JFLAP apparently, not familiar).
Just from looking at the start of the regex you can see some cool things. For instance, TIL that all numbers of the form 1555....55554 and 85555....5554 are multiples of 7. Fun exercise to prove it.
I'm the guy who wrote the original implementation that the question mentions. I have no clue who 'Charles' who asked the question is, or how he found my toy implementation.
Could the AKS primality test, which showed that PRIMES is in P, be implemented as an honest regular expression? Presumably not because then we'd know that PRIMES is not just in P but is a regular language, which would be too good to be true.
Sorry, but how is this different than pointing to three apples and explaining someone that this is what three is? Would you say it's not three because they are apples? I think it does not really matter how you represent the number. After all, all you see on your screen is "fake" it's just binary under the hood.
The problem with the unary numeral system is that numerals grow linearly with value, while for binary (ternary etc) they only grow logarithmically. So they need much less space and can be much faster read and written. An algorithm with relies on unary numerals probably has worse computational complexity than one that uses binary or higher.
Actually, an algorithm working on unary input tends to have better computational complexity than an algorithm working on binary input: an algorithm that is polynomial in the numeric value will be a polynomial algorithm on unary input, but an exponential algorithm on binary input. Such algorithms are usually called pseudo-polynomial[0]; indeed, this kind of primality testing is pseudo-polynomial.
> an algorithm that is polynomial in the numeric value will be a polynomial algorithm on unary input, but an exponential algorithm on binary input.
That's comparing apples to oranges, because the two input types have vastly different lengths relative to their value. Unary input length itself grows exponentially with binary input length, for the same numeric value, cancelling out the unary advantage. So unary isn't faster than binary.
In fact, I think it's pretty clear that unary representation is slower and takes more space. Just think about the rough number of steps you would need to add or even multiply two very large numbers, e.g. in a Turing machine. It would be obviously vastly more if the numbers are given in unary rather than in binary.
> That's comparing apples to oranges, because the two input types have vastly different lengths relative to their value.
It really is not, because complexity is fundamentally a function of _the length of the input_ (specifically: it measures how the runtime (or space usage) grows with growing input). If you have longer input, your Turing machine can spend more time to compute its answer. Also note that this assumes that _the input_ is encoded in unary, if you get binary input and need to spend time and space to convert it to unary representation, sure, that will lead to an exponential blow-up.
Edit:
> Just think about the rough number of steps you would need to add or even multiply two very large numbers, e.g. in a Turing machine. It would be obviously vastly more if the numbers are given in unary rather than in binary.
First, note that those are not actually pseudo-polynomial, as the number of steps needed for addition (or multiplication) depend on the number of digits, not the numeric values involved. Yet, even here unary encoding _does not have worse complexity_, since you still take polynomially many steps in the length of the input (i.e., the length of the unary encodings of the input values). Yes, all the inputs will be exponentially larger, so in practice it's certainly not the better algorithm, but _the complexity_ is not worse, since that's only concerned with the asymptotic behaviour.
When we deal with algorithms involving numbers, we are interested in actual numbers, not in their numeral representation. It would be absurd to compare unary and binary addition relative to the same numeral length. We are ultimately interested in numbers, not numerals. Adding one million to one million is simply much faster for binary addition than for unary addition, even if unary addition has better complexity relative to its own encoding than binary addition has to its encoding. We are interested in complexity relative to the numerical value, not in the length of their respective encodings.
Edit: By the way, as the Wikipedia piece notices, the binary addition algorithm is O(log(n)) in time _relative to the value_, while unary addition is presumably O(n) relative to the value (just writing the numeral out alone takes O(n) steps). So the time complexity is in fact better. Probably the same holds for space complexity. Moreover, only in this case makes a comparison even sense, since we are comparing the same values in both cases, while for the numeral case we would be comparing the lengths of two different types of numerals, apples to oranges.
> It would be absurd to compare unary and binary addition relative to the same numeral length.
Why? The Turing machine has no concept of numeric values, it only knows about the length of whatever the input is.
> Adding one million to one million is simply much faster for binary addition than for unary addition, even if unary addition has better complexity relative to its own encoding than binary addition has to its encoding.
From a complexity standpoint, adding one million to one million is O(1), irregardless of the encoding.
> We are interested in complexity relative to the numerical value, not in the length of their respective encodings.
But the complexity relative to the numerical value is not even well-defined, since it _depends_ on the choice of the encoding.
> By the way, as the Wikipedia piece notices, the binary addition algorithm is O(log(n)) in time _relative to the value_, while unary addition is presumably O(n) relative to the value (just writing the numeral out alone takes O(n) steps). So the time complexity is in fact better.
It really is not better. Time complexity is _always_ measured relative to the length of the input, and for binary encoding, the length of the input is O(log(n)), so, taking O(log(n)) steps for the addition is linear, same as the linear time needed for adding numbers in unary encoding (which is basically just copying the input to the output).
> Moreover, only in this case makes a comparison even sense, since we are comparing the same values in both cases, while for the numeral case we would be comparing the lengths of two different types of numerals, apples to oranges.
But _the numeral is not the input_, its _encoding_ is. This is really the whole point, and it is _precisely_ because you get different complexities for different encodings.
> Why? The Turing machine has no concept of numeric values, it only knows about the length of whatever the input is.
That's irrelevant. We are interested in addition, or multiplication, or whatever, which is an operation between numbers (values), and it can be faster or slower depending on which numerals (encoding) are used.
> But the complexity relative to the numerical value is not even well-defined, since it _depends_ on the choice of the encoding.
Yes it depends on the encoding, no it is of course well-defined. You can measure two different algorithms which use two different encodings, relative to the same thing, the value.
> complexity is _always_ measured relative to the length of the input
That's simply not true. There is even a Wikipedia article about it. Even if it calls it "pseudo".
> But _the numeral is not the input_, its _encoding_ is
But indeed that isn't what three is, three is the concept that comes from saying "these are three apples" and then "these are three pears" and then "these are three cars" and then saying "three" is the conceptual thing that's in common here.
And here, validating a string of ones is simply not what anyone would mean if you said "check whether this number is a prime number" without any qualification, so I agree that whilst still cute, the title is a bit click-baity and not at all what I expected it to be about either.
EDIT: it seems though that the original post, shared via archive.org elsewhere in the thread does do this for what one would normally assume -- and does so by using `(1 x shift) !~` -- so if anyone else was confused, read the original post.
Think about it a different way: the regex thing can only deal with unary numbers, like your computer can only deal with binary. It's really no different. I'm not sure if you can write a decimal-to-unary converter in regex but I reckon it's possible. At least using the non-regular regex like the original prime program did.
> Regular expression matching is a crucial task in several networking applications. Current implementations are based on one of two types of finite state machines. Non-deterministic finite automata (NFAs) have minimal storage demand but have high memory bandwidth requirements. Deterministic finite automata (DFAs) exhibit low and deterministic memory bandwidth requirements at the cost of increased memory space. It has already been shown how the presence of wildcards and repetitions of large character classes can render DFAs and NFAs impractical. Additionally, recent security-oriented rule-sets include patterns with advanced features, namely back-references, which add to the expressive power of traditional regular expressions and cannot therefore be supported through classical finite automata.
> In this work, we propose and evaluate an extended finite automaton designed to address these shortcomings. First, the automaton provides an alternative approach to handle character repetitions that limits memory space and bandwidth requirements. Second, it supports back-references without the need for back-tracking in the input string. In our discussion of this proposal, we address practical implementation issues and evaluate the automaton on real-world rule-sets. To our knowledge, this is the first high-speed automaton that can accommodate all the Perl-compatible regular expressions present in the Snort network intrusion and detection system.
----
The awkward part is that a PCRE is demonstrably more powerful than a regular language, but not as powerful as a CFG (you can write CFGs that can't be matched by a PCRE - ([{}]) matching, a^nb^n and so on). So the question that gets interesting is "can every PCRE be matched by a CFG?"
> Note that the set of grammars corresponding to recursive languages is not a member of this hierarchy; these would be properly between Type-0 and Type-1.
So there's a language between the Turing machine and the linear bounded Turing machine that matches a context sensitive language.
Is the PCRE something between Type-2 and Type-3? or is it a Type-3+I?
> Many extensions and variations on Chomsky's original hierarchy of formal grammars have been developed, both by linguists and by computer scientists, usually either in order to increase their expressive power or in order to make them easier to analyze or parse. ...
I encourage you to re-read the article: you check that the number does not match the regex:
> Is 7 prime?
>
> To know this, the function first generates “1111111” (from “1” \* 7) and tries to
> see if that string does not match /^1?$|^(11+?)\1+$/. If there is no match, then
> the number is prime.
>
> Notice that the regular expression has two parts (separated with the vertical bar |).
>
> The first part is /^1?$/ is trivial and matches with beginning of line (^), an
> optional 1 (1?) and end of line ($) which implies that it matches either the
> empty string or “1”. This simply indicates that calling that function with n==0
> or n==1 will correctly return false (as the “1” \* n will match with the first
> part of the regular expression)
I agree that the article formatting is a bit misleading.
In case anyone is wondering, gpt4 provides the same solution. It takes some prodding though, and you get treated with a lecture of how inefficient it is before it provides the solution.
"Apologies for any confusion before, and I understand the nature of your request now. It's a fun challenge to think about! While it's not practically useful or efficient, and it is actually quite limited, a creative theoretical solution exists that involves regular expressions. This specific solution only works for very small numbers due to the inherent limitations of regex. It's based on the principle that a number is prime if it cannot be expressed as a product of two smaller natural numbers. We represent the number as a repeated string of 1s, then check for repeating sub-patterns.
For instance, this Perl regex checks if a number, represented as unary (a string of 1s), is prime:
perl
Copy code
/^1?$|^(11+?)\1+$/
In this regex:
^1?$ checks for the case where you have zero or one '1', which represent the numbers 0 and 1, neither of which are prime.
^(11+?)\1+$ attempts to match the input with a repeating group of '1's. If a match is found, the number is not prime, because a repeating subgroup of '1's represents a divisor.
This pattern should be used in a negative sense: if a string of 1s (representing your number in unary form) matches this pattern, it's not prime. If it doesn't match, then it is prime.
Again, this is quite impractical for real use cases as the unary representation is extremely inefficient, and the computation grows quickly with larger numbers, but it's a neat way of showing how flexible and powerful regular expressions can be."
The 2022 discussion has a more formal proof using the pumping lemma.