A regular expression to check for prime numbers (2007)

tylerhou · on June 21, 2023

Technically not a regular expression, as (intuitively) the backreference has to “count” the number of 1’s that were matched. Regular expressions are equivalent to finite state machines, which cannot count arbitrarily high (any state machine that can count arbitrarily high needs an infinite number of states, since there are infinite natural numbers).

The 2022 discussion has a more formal proof using the pumping lemma.

yard2010 · on June 21, 2023

Fun fact about the pumping lemma - my best friend's grandparent worked on it (Eli Shamir). When we were taught it during my bsc the TA who knew he was the grandson of one of the inventors, asked during the lecture where he was, if he could say something about his grandfather, and it turns out he played hockey from the class to work on something else, embarrassing!

The pumping lemma is a great milestone in this field, so simple, yet so powerful.

dgfitz · on June 21, 2023

Did you mean "played hooky" instead of "played hockey" by chance? :)

aleph_minus_one · on June 21, 2023

> The TA [...] asked during the lecture where he was, if he could say something about his grandfather, it turns out he played [hooky] from the class to work on something else, embarrassing!

If the "working on something else" means "working on some hard research problem", this is not embarrassing, but stylish. :-)

remram · on June 21, 2023

> he played hockey from the class to work on something else

Sorry I can't parse that

jappgar · on June 21, 2023

Pretty sure they meant "playing hooky" :)

thedailymail · on June 21, 2023

I think hockey is supposed to be hooky.

remram · on June 21, 2023

As in, the grandson wasn't attending when the TA asked him the question. I see.

CoastalCoder · on June 21, 2023

Agreed. I read the post because the result was so shocking.

TIL that the computer-sciencey definition of RE isn't the only one in use.

wongarsu · on June 21, 2023

When programmers say regular expression they usually mean Perl Compatible Regular Expressions (PCRE). Or I guess PCRE-compatible expressions, since there are other implementations of the same syntax out there. If you go back far enough in time they were once regular expressions in the computer science sense, but then features got added, and now they are incredibly powerful but can take basically infinite memory and time.

https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expres...

chrismorgan · on June 21, 2023

Although general PCRE syntax has clearly won, that’s distinct from the infinite-memory-and-time matter. Engines that don’t use backtracking and can search in linear time (O(regex-size × search-text-size)) have been making a serious comeback in recent years, at only the cost of not supporting lookaround assertions or backreferences. For example, the re2 library, or Rust’s regex crate, both of which are very popular.

knome · on June 21, 2023

Russ Cox has an excellent write up on it.

"Regular Expression Matching Can Be Simple And Fast" (2007)

https://swtch.com/~rsc/regexp/regexp1.html

eru · on June 21, 2023

Alas that approach only works well for your basic regular expressions that support Kleene-Star, concatenation and union.

Regular languages are also closed under intersection and taking prefix etc (see https://en.wikipedia.org/wiki/Regular_language#Closure_prope...), and if you want to expose those operators in your regular expressions, Russ Cox's technique doesn't really work.

knome · on June 21, 2023

I suspect it should be possible to transform two intersected (K*,C,U) expressions into a new (K*,C,U) expression by walking over the two, dropping incompatible terms, reducing sets, and redistributing and expanding sub-expressions to combine the two, though I do not have time to take a serious look at this at the moment.

I also suspect that this might result in a combinatorial explosion of fused terms, but that's irrelevant as we're only worried about whether intersection can be implemented in Russ Cox's method, not whether a regular expression might require several earths or so worth of RAM to hold :)

Nothing in the wiki article mentions taking a prefix, but that just sounds like `(prefix|)`?

eru · on June 21, 2023

You can still explicitly build the finite automaton for your closure-extended regular language. That one might be big, but matching is still linear in the length of the text you are searching through.

As you point out Russ Cox's method can not deal with complement and intersection. You can hack it up by putting a compiler in front, but then you might as well compile to finite automata directly. Or you use derivatives: https://en.wikipedia.org/wiki/Brzozowski_derivative or https://www.ccs.neu.edu/home/turon/re-deriv.pdf (or https://well-typed.com/blog/2020/06/fix-ing-regular-expressi... for a Haskell flavour).

> Nothing in the wiki article mentions taking a prefix, but that just sounds like `(prefix|)`?

They implicitly mention prefixes, when they talk about 'the trio operations: string homomorphism, inverse string homomorphism, and intersection with regular languages.' If I remember right, the prefix operation is:

Take a language L, restrict it to the strings that have the right prefix: `L' := L & (prefix.*)` then chop off that prefix from every member of L'.

See https://en.wikipedia.org/wiki/Quotient_of_a_formal_language which this is a special case of. The Brzozowski derivative article also explains it in more detail.

finitemonkey · on June 21, 2023

> and now they are incredibly powerful but can take basically infinite memory and time.

Which is why the "RE" in the article is excrutiatingly slow, given that it needs to perform insane amounts of backtracking. In contrast, "real" regular expression checkers run in linear time.

Also, nitpicking, but they take unbounded time. It's still finite.

petters · on June 21, 2023

They said “basically,” so no need to nitpick. And I’m not sure about unbounded; surely it is bounded by a suitable exponential function?

finitemonkey · on June 21, 2023

> They said “basically,” so no need to nitpick.

The inflationary use of "infinite" especially in tech crowds that should know better deserves pushback.

There are models of computation that actually model infinite computation. A well-studied one is Büchi automata, basically a generalization of finite state automata to infinite sequences.

> And I’m not sure about unbounded; surely it is bounded by a suitable exponential function?

Nope. You can nest them and for any function that you give you can produce a RE that takes longer than that function to evaluate. (That's literally what "unbounded" means. There is no bound. Doesn't mean it's "infinite". All those computations terminate, which is literally what "finite" means in this context.)

cubefox · on June 21, 2023

I think with "unbounded but finite" he meant potential infinity rather than actual infinity.

disconcision · on June 21, 2023

funnily enough the original POSIX regular expressions (BRE) were strictly less powerful than regular expressions (CS sense), lacking a general alternation operator (though one was provided in ERE)

tannhaeuser · on June 21, 2023

[flagged]

elros · on June 21, 2023

So what is it that they mean, in your opinion?

tannhaeuser · on June 21, 2023

https://en.m.wikipedia.org/wiki/Regular_expression

jkrems · on June 21, 2023

And that article says:

> Perl regexes have become a de facto standard, having a rich and powerful set of atomic expressions.

johnfn · on June 21, 2023

Come now, every time a colleague says they parsed a file with a regex in javascript, do you turn to them and tell them "well, you didn't actually use a _regular_ expression..."

_pvxk · on June 21, 2023

Only every time they claim a major mathematical breakthrough using regular expressions.

underdeserver · on June 21, 2023

Nobody claimed that.

SerpentJoe · on June 21, 2023

It's true, the article doesn't make that claim. We think of primes as being mysterious and out of reach, and we're conditioned to think "mathematical / computational breakthrough" when we read the words "prime number" in a headline, but this is only because we've found all of them that aren't huge enough to be mysterious. Computing prime numbers is trivial as long as you don't mind them being the same ones everybody else already found.

Tainnor · on June 21, 2023

Not sure what you mean by "trivial", I mean a simple prime finding algorithm is easy to write down, but it will be very inefficient, especially for large primes (the very ones you use in cryptography).

Coming up with an efficient prime-finding algorithm like Miller-Rabin* is far from trivial.

* Technically it's just a prime-checking algorithm, but you can just generate random numbers until you've verified one of them is prime.

Tade0 · on June 21, 2023

I do find it useful to know that regular expressions in JS(and many other languages really) are not actually regular and why - allows one to decide if a given problem can be solved at all using them.

retrac · on June 21, 2023

> Regular expressions are equivalent to finite state machines, which cannot count arbitrarily high (any state machine that can count arbitrarily high needs an infinite number of states, since there are infinite natural numbers).

The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.

Informally, regular expressions like in Perl are a kind of string rewriting. And many string rewriting rules, if they can be iteratively applied, are Turing-complete (with the usual caveat about infinite storage and time). The usual definition of a regular expression excludes that kind of iterative application, though. It's just a matching pattern; rewriting is technically outside of the concept of a regular expression.

finitemonkey · on June 21, 2023

> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.

Right. And if you really want to push that argument further then there are numbers for which your machine can't decide whether they are prime or not. Most of them, in fact, i.e., for all but a finite number of primes and non-primes, your machine can't tell.

For the theory of computability, this is not a useful model. Neither for complexity theory. Statements like "My machine can sort in constant time" which is arguably true but not useful either. It only holds up to a certain collection size which in a sense makes it both nonsensical and useless. Even concepts like the Chomsky hierarchy with context free languages being one of the prominent components make little sense as long as your computational model is finite state.

Instead, we model them as unbounded, to be able to differentiate between classes of computation problems as well as reason about complexity. The Turing Machine is one of the simple and elegant models, but there are others for other purposes.

I wish that basic education about the theory of computation would stress more why theory treats things like this, to avoid this kind of confusion. (And then people would need to take those classes as well, and not just claim they can write React apps just fine without a degree.)

staunton · on June 21, 2023

> And then people would need to take those classes as well, and not just claim they can write React apps just fine without a degree.

Weird tangent. They can write React apps without a degree, can't they? What's wrong with that?

The amount of complexity reasoning you need for something like that is around "this part seems slow, maybe there's a better algorithm somewhere" or "that's a lot of nested loops, maybe I can do with less". Knowing more might help in some pretty rare cases but doesn't seem like a requirement. On the other hand, knowing the complexity of all algorithms and blindly assuming "good complexity = good performance" while never checking is a recipe for desaster.

finitemonkey · on June 21, 2023

> On the other hand, knowing the complexity of all algorithms and blindly assuming "good complexity = good performance" while never checking is a recipe for desaster.

Exactly. That's the difference between an actual education and having looked up some Big-O notation on Wikipedia.

kaba0 · on June 21, 2023

> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high

Come on, can we drop this useless trivia that floats around? No, computers are Turing-machines for all practical purposes and no the memory is not the tape, unless your computer is sandboxed from everything. As soon as it has some side effect channels, its “tape” can be as large as needed. For example, it can use the internet to store petabytes of state. But if we want to be more pedantic, it can control and observe physical things, making the whole universe its state space.

Which is still finite, so yeah, even more pedantically you are right due to mathematical infinity not existing, but the actual distinction of a Turing machine is “lazy”, if you don’t hit the limit, you may as well reason as if it were a Turing machine.

Quekid5 · on June 21, 2023

> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.

I'm not sure how that's relevant. Even given infinite space, there still wouldn't exist a Regular Expression (the CS kind) you could write down in finite time that would match only primes.

At least IIRC, it's been a 20+ years since I last took a Discrete Math course :)

umanwizard · on June 21, 2023

> At least IIRC, it's been a 20+ years since I last took a Discrete Math course :)

Your memory is correct and there is very simple intuition for this. Regular languages are those that can be matched by a machine with bounded memory. So, let's say that bound is "n" bits. Then your machine will not be able to read in unary strings larger than 2^n.

dheera · on June 21, 2023

> Even given infinite space, there still wouldn't exist a Regular Expression (the CS kind) you could write down in finite time that would match only primes.

The point is, neither would there exist a non-(Regular Expression) method that could do such, if you're running it on a finite state machine (your computer).

tylerhou · on June 21, 2023

But that argument is not interesting. "We can't count arbitrarily high because the universe is finite." Yeah, duh. (Observable universe, for the pedants.)

It's more interesting to ask: suppose you did have a magical stick of RAM that had infinite storage, and suppose you augmented your computer to store and query that memory. Then -- yes -- your computer could determine whether any number, however large, is prime. An FSM could not.

dheera · on June 21, 2023

If you have an infinite stick of RAM, regular expressions wouldn't be FSMs, either, they could have infinite state space. They would terminate in finite time that is exponential in the size of the prime but nevertheless would still halt.

kaba0 · on June 21, 2023

A finite state machine can’t have infinite state space.. also, you can write a finite Turing program that does exactly that, but you can’t write a finite description of an “infinite” state space, which is the whole point.

Ad absurdum everything would be computable (recognizable) by FSMs, simply by listing all the infinite number of possibilities. listOfAllPrimes.concat(“|”), here you are..

umanwizard · on June 21, 2023

The point being made is that “regular expressions”, in the original CS meaning, do not have stack space or anything analogous. “Regular expressions” in modern programming practice are often different and more powerful, of course.

funcDropShadow · on June 21, 2023

No, you could create arbitrarily large FSMs, but there are no infinite finite state machines.

kaba0 · on June 21, 2023

There is, the property of being a Turing machine would only be thrown out when you actually hit the limit. Besides computers being able to modify state not only in their memory but e.g. outside memory, I can also look at my computer as a lazy-taped Turing machine that chugs along writing only to its memory, and when it were to go off it pauses and waits for me to add additional memory.

This is just useless pedantism that diminishes the very imporrant categorization of Chomsky.

lapinot · on June 21, 2023

The important point here was given infinite space.

noctune · on June 21, 2023

Personally, I use "regex" for the informal string-matching/rewriting engines found in most languages and "regular expression" for the formal definition.

FeepingCreature · on June 21, 2023

By which, of course, this submission title is a lie.

necovek · on June 21, 2023

I don't think the GP is the original poster, so them claiming how they "personally" differentiate between the two does not make the submission title a lie.

We don't know how original poster differentiates.

mjan22640 · on June 21, 2023

A finite self referential algorithm that checks all numbers for being prime does exist. Running it on a finite computer will eventually hit the bounds of the computer, but that is a limitation outside of the algorithm.

A finite regular expression that checks all numbers for being prime cannot exist already in principle.

cubefox · on June 21, 2023

Technically, a computer is also just a large finite state machine. Which suggests computability theory is unable to accurately account for the intuitive difference between computers and simpler automatons. I suspect it has something to to with the computational time/space complexity of the algorithms they are able to execute.

eru · on June 21, 2023

If you want something with an actual regular expression, have a look at https://codegolf.stackexchange.com/questions/3503/hard-code-... which is a code golf for regular expressions to test divisibility by 7.

toth · on June 21, 2023

Very cool! The top answer has a 12,731 character regular expression that matches base 10 numbers that are divisible by 7. (There is an answer with only 105 characters, but that uses .NET "regular expressions" which have some pretty gnarly extensions and are definitely not regular languages).

The regex itself is generated from a relatively simple DFA based on mod 7 arithmetic on digits using a program (JFLAP apparently, not familiar).

Just from looking at the start of the regex you can see some cool things. For instance, TIL that all numbers of the form 1555....55554 and 85555....5554 are multiples of 7. Fun exercise to prove it.

eru · on June 23, 2023

I'm the guy who wrote the original implementation that the question mentions. I have no clue who 'Charles' who asked the question is, or how he found my toy implementation.

evanb · on June 21, 2023

Could the AKS primality test, which showed that PRIMES is in P, be implemented as an honest regular expression? Presumably not because then we'd know that PRIMES is not just in P but is a regular language, which would be too good to be true.

https://en.wikipedia.org/wiki/AKS_primality_test

aleph_minus_one · on June 21, 2023

Via the pumping lemma, you can easily prove that testing whether a number is a prime number cannot be implemented as a regular expression.

Tainnor · on June 21, 2023

Via the context-free pumping lemma, you can even prove that prime numbers don't form a context-free language either.

taejo · on June 21, 2023

And via Parikh's theorem, with a one-symbol alphabet, regular and context-free languages are the same.

neilk · on June 21, 2023

The original exegesis, which prompted this blog post, was by me.

I am currently transitioning to a different blog engine, so the page is offline right now, but it's still available here

https://web.archive.org/web/20220513113609/neilk.net/blog/20...

krupan · on June 21, 2023

Much better explanation, thank you

hoytech · on June 21, 2023

You can also factor integers with Regexp::Exhaustive:

https://metacpan.org/pod/Regexp::Exhaustive#Finding-all-divi...

And if you're in Toronto in 3 weeks time, come see Abigail (the original inventor of the prime number regexp) solve the N-Queens problem with regexp:

https://tprc2023.sched.com/event/1LhoB/the-n-queens-problem-...

jimhefferon · on June 21, 2023

Thank you for crediting Abigail.

stevage · on June 21, 2023

Took me a while to realise that it is not actually validating whether a given number is a prime number or not.

It is validating whether a string of consecutive 1's has a prime length.

In other words, it can't tell whether "3" is prime, unless you express it as "111".

Much less exciting.

tiborsaas · on June 21, 2023

Sorry, but how is this different than pointing to three apples and explaining someone that this is what three is? Would you say it's not three because they are apples? I think it does not really matter how you represent the number. After all, all you see on your screen is "fake" it's just binary under the hood.

cubefox · on June 21, 2023

The problem with the unary numeral system is that numerals grow linearly with value, while for binary (ternary etc) they only grow logarithmically. So they need much less space and can be much faster read and written. An algorithm with relies on unary numerals probably has worse computational complexity than one that uses binary or higher.

mmarx · on June 21, 2023

Actually, an algorithm working on unary input tends to have better computational complexity than an algorithm working on binary input: an algorithm that is polynomial in the numeric value will be a polynomial algorithm on unary input, but an exponential algorithm on binary input. Such algorithms are usually called pseudo-polynomial[0]; indeed, this kind of primality testing is pseudo-polynomial.

[0] https://en.wikipedia.org/wiki/Pseudo-polynomial_time

cubefox · on June 21, 2023

> an algorithm that is polynomial in the numeric value will be a polynomial algorithm on unary input, but an exponential algorithm on binary input.

That's comparing apples to oranges, because the two input types have vastly different lengths relative to their value. Unary input length itself grows exponentially with binary input length, for the same numeric value, cancelling out the unary advantage. So unary isn't faster than binary.

In fact, I think it's pretty clear that unary representation is slower and takes more space. Just think about the rough number of steps you would need to add or even multiply two very large numbers, e.g. in a Turing machine. It would be obviously vastly more if the numbers are given in unary rather than in binary.

mmarx · on June 21, 2023

> That's comparing apples to oranges, because the two input types have vastly different lengths relative to their value.

It really is not, because complexity is fundamentally a function of _the length of the input_ (specifically: it measures how the runtime (or space usage) grows with growing input). If you have longer input, your Turing machine can spend more time to compute its answer. Also note that this assumes that _the input_ is encoded in unary, if you get binary input and need to spend time and space to convert it to unary representation, sure, that will lead to an exponential blow-up.

Edit: > Just think about the rough number of steps you would need to add or even multiply two very large numbers, e.g. in a Turing machine. It would be obviously vastly more if the numbers are given in unary rather than in binary.

First, note that those are not actually pseudo-polynomial, as the number of steps needed for addition (or multiplication) depend on the number of digits, not the numeric values involved. Yet, even here unary encoding _does not have worse complexity_, since you still take polynomially many steps in the length of the input (i.e., the length of the unary encodings of the input values). Yes, all the inputs will be exponentially larger, so in practice it's certainly not the better algorithm, but _the complexity_ is not worse, since that's only concerned with the asymptotic behaviour.

cubefox · on June 21, 2023

When we deal with algorithms involving numbers, we are interested in actual numbers, not in their numeral representation. It would be absurd to compare unary and binary addition relative to the same numeral length. We are ultimately interested in numbers, not numerals. Adding one million to one million is simply much faster for binary addition than for unary addition, even if unary addition has better complexity relative to its own encoding than binary addition has to its encoding. We are interested in complexity relative to the numerical value, not in the length of their respective encodings.

Edit: By the way, as the Wikipedia piece notices, the binary addition algorithm is O(log(n)) in time _relative to the value_, while unary addition is presumably O(n) relative to the value (just writing the numeral out alone takes O(n) steps). So the time complexity is in fact better. Probably the same holds for space complexity. Moreover, only in this case makes a comparison even sense, since we are comparing the same values in both cases, while for the numeral case we would be comparing the lengths of two different types of numerals, apples to oranges.

mmarx · on June 21, 2023

> It would be absurd to compare unary and binary addition relative to the same numeral length.

Why? The Turing machine has no concept of numeric values, it only knows about the length of whatever the input is.

> Adding one million to one million is simply much faster for binary addition than for unary addition, even if unary addition has better complexity relative to its own encoding than binary addition has to its encoding.

From a complexity standpoint, adding one million to one million is O(1), irregardless of the encoding.

> We are interested in complexity relative to the numerical value, not in the length of their respective encodings.

But the complexity relative to the numerical value is not even well-defined, since it _depends_ on the choice of the encoding.

> By the way, as the Wikipedia piece notices, the binary addition algorithm is O(log(n)) in time _relative to the value_, while unary addition is presumably O(n) relative to the value (just writing the numeral out alone takes O(n) steps). So the time complexity is in fact better.

It really is not better. Time complexity is _always_ measured relative to the length of the input, and for binary encoding, the length of the input is O(log(n)), so, taking O(log(n)) steps for the addition is linear, same as the linear time needed for adding numbers in unary encoding (which is basically just copying the input to the output).

> Moreover, only in this case makes a comparison even sense, since we are comparing the same values in both cases, while for the numeral case we would be comparing the lengths of two different types of numerals, apples to oranges.

But _the numeral is not the input_, its _encoding_ is. This is really the whole point, and it is _precisely_ because you get different complexities for different encodings.

cubefox · on June 22, 2023

> Why? The Turing machine has no concept of numeric values, it only knows about the length of whatever the input is.

That's irrelevant. We are interested in addition, or multiplication, or whatever, which is an operation between numbers (values), and it can be faster or slower depending on which numerals (encoding) are used.

> But the complexity relative to the numerical value is not even well-defined, since it _depends_ on the choice of the encoding.

Yes it depends on the encoding, no it is of course well-defined. You can measure two different algorithms which use two different encodings, relative to the same thing, the value.

> complexity is _always_ measured relative to the length of the input

That's simply not true. There is even a Wikipedia article about it. Even if it calls it "pseudo".

> But _the numeral is not the input_, its _encoding_ is

A numeral is an encoding of a number.

JulianWasTaken · on June 21, 2023

But indeed that isn't what three is, three is the concept that comes from saying "these are three apples" and then "these are three pears" and then "these are three cars" and then saying "three" is the conceptual thing that's in common here.

And here, validating a string of ones is simply not what anyone would mean if you said "check whether this number is a prime number" without any qualification, so I agree that whilst still cute, the title is a bit click-baity and not at all what I expected it to be about either.

EDIT: it seems though that the original post, shared via archive.org elsewhere in the thread does do this for what one would normally assume -- and does so by using `(1 x shift) !~` -- so if anyone else was confused, read the original post.

IshKebab · on June 21, 2023

It's less interesting because decoding decimal numbers with regex is a much more impressive feat.

comonoid · on June 21, 2023

You can certainly do calculations with apples, but doing RSA encryption with them is just impractical.

stefncb · on June 21, 2023

Think about it a different way: the regex thing can only deal with unary numbers, like your computer can only deal with binary. It's really no different. I'm not sure if you can write a decimal-to-unary converter in regex but I reckon it's possible. At least using the non-regular regex like the original prime program did.

oneeyedpigeon · on June 21, 2023

> validating whether a given number is a prime number or not ... It is validating whether a string of consecutive 1's has a prime length.

I don't think you've really explained why those two things are different. They're different methods but they're checking the exact same thing.

stevage · on June 21, 2023

Well, no. The two things take input in two different forms, and one is substantially easier than the other.

It's a bit like if you claimed that a program that voice to text is the same as one that does text to text.

oneeyedpigeon · on June 21, 2023

You're totally right; I apologise.

shric · on June 21, 2023

(2007)

Previous discussions (with >25 comments):

2015: https://news.ycombinator.com/item?id=9039537 (67 comments)

2022: https://news.ycombinator.com/item?id=30564287 (121 comments)

dang · on June 21, 2023

Thanks! Macroexpanded:

A regular expression to check for prime numbers (2007) - https://news.ycombinator.com/item?id=30564287 - March 2022 (121 comments)

A regular expression to check for prime numbers - https://news.ycombinator.com/item?id=9039537 - Feb 2015 (67 comments)

A regular expression to check for prime numbers - https://news.ycombinator.com/item?id=1486158 - July 2010 (13 comments)

Regex to check for prime numbers - https://news.ycombinator.com/item?id=707236 - July 2009 (9 comments)

A regular expression to check for prime numbers - https://news.ycombinator.com/item?id=58780 - Sept 2007 (8 comments)

pkoird · on June 21, 2023

Clever! I'm kinda being pedantic but I guess using \1 makes it a non-regular expression? I'm not even sure if it's CFG anymore.

eru · on June 21, 2023

If you want something with an actual regular expression, have a look at https://codegolf.stackexchange.com/questions/3503/hard-code-... which is a code golf for regular expressions to test divisibility by 7.

shagie · on June 21, 2023

Correct. The \1 holds some external state... but this can be overcome with a different architecture. If you want to start down that rabbit hole.

Extending finite automata to efficiently match Perl-compatible regular expressions https://www.researchgate.net/publication/221325349_Extending...

> Regular expression matching is a crucial task in several networking applications. Current implementations are based on one of two types of finite state machines. Non-deterministic finite automata (NFAs) have minimal storage demand but have high memory bandwidth requirements. Deterministic finite automata (DFAs) exhibit low and deterministic memory bandwidth requirements at the cost of increased memory space. It has already been shown how the presence of wildcards and repetitions of large character classes can render DFAs and NFAs impractical. Additionally, recent security-oriented rule-sets include patterns with advanced features, namely back-references, which add to the expressive power of traditional regular expressions and cannot therefore be supported through classical finite automata.

> In this work, we propose and evaluate an extended finite automaton designed to address these shortcomings. First, the automaton provides an alternative approach to handle character repetitions that limits memory space and bandwidth requirements. Second, it supports back-references without the need for back-tracking in the input string. In our discussion of this proposal, we address practical implementation issues and evaluate the automaton on real-world rule-sets. To our knowledge, this is the first high-speed automaton that can accommodate all the Perl-compatible regular expressions present in the Snort network intrusion and detection system.

----

The awkward part is that a PCRE is demonstrably more powerful than a regular language, but not as powerful as a CFG (you can write CFGs that can't be matched by a PCRE - ([{}]) matching, a^nb^n and so on). So the question that gets interesting is "can every PCRE be matched by a CFG?"

Is it a fork? or is it a non-integer language?

https://en.wikipedia.org/wiki/Chomsky_hierarchy

> Note that the set of grammars corresponding to recursive languages is not a member of this hierarchy; these would be properly between Type-0 and Type-1.

So there's a language between the Turing machine and the linear bounded Turing machine that matches a context sensitive language.

Is the PCRE something between Type-2 and Type-3? or is it a Type-3+I?

https://en.wikipedia.org/wiki/Formal_grammar#Other_forms_of_...

> Many extensions and variations on Chomsky's original hierarchy of formal grammars have been developed, both by linguists and by computer scientists, usually either in order to increase their expressive power or in order to make them easier to analyze or parse. ...

davidgay · on June 21, 2023

> The awkward part is that a PCRE is demonstrably more powerful than a regular language, but not as powerful as a CFG

Hmm, there's a reduction of 3-cnf-sat to perl regexps, making them NP-complete. As CFGs are in NP...

https://perl.plover.com/NPC/NPC-3SAT.html

shagie · on June 21, 2023

Reading up on the current spec, apparently a PCRE can do recursive calls.

https://www.pcre.org/original/doc/html/pcrepattern.html#SEC2...

    \( ( [^()]++ | (?R) )* \)

https://regex101.com/r/eBtSTM/1 for a slightly different formulation of that regex.

vignesh_warar · on June 21, 2023

Interesting!

If anyone is seeking a math formula for primes, here is one: https://en.wikipedia.org/wiki/Formula_for_primes

There is also a good YouTube video that explains this: https://www.youtube.com/watch?v=j5s0h42GfvM

albert_e · on June 21, 2023

Thanks for sharing

the video was very interesting

venusenvy47 · on June 21, 2023

That's cool. I had no idea there was a symbol for the floor function.

raldi · on June 21, 2023

A better title would be “A regex to check if a string contains a prime number of characters”

remram · on June 21, 2023

I think it checks the opposite, it matches if it's not prime.

medv · on June 21, 2023

> To know this, the function first generates “1111111”

Not that interesting.

dmd · on June 21, 2023

Also fun, if not anywhere as clever: perl -le '$_=1;print&&s/((\d)\2*)/length($1).$2/ge while length()<64'

ufo · on June 21, 2023

Why does it have /^1?$/ ? Wouldn't that accept 0 and 1 as primes?

rperce · on June 21, 2023

I encourage you to re-read the article: you check that the number does not match the regex:

  > Is 7 prime?  
  >
  > To know this, the function first generates “1111111” (from “1” \* 7) and tries to
  > see if that string does not match /^1?$|^(11+?)\1+$/. If there is no match, then
  > the number is prime.
  >
  > Notice that the regular expression has two parts (separated with the vertical bar |).
  >
  > The first part is /^1?$/ is trivial and matches with beginning of line (^), an
  > optional 1 (1?) and end of line ($) which implies that it matches either the
  > empty string or “1”. This simply indicates that calling that function with n==0
  > or n==1 will correctly return false (as the “1” \* n will match with the first
  > part of the regular expression)

I agree that the article formatting is a bit misleading.

wongarsu · on June 21, 2023

The regular expression matches all non-primes (in base 1 encoding). 0 and 1 are not prime, so it matches them.

hospitalJail · on June 21, 2023

btw I found it strange that Excel has no built in way to do Re... That is a pain.

underdeserver · on June 21, 2023

People are commented "It's not a breakthrough! It's not CS-y _regular_ expressions!"

I suggest relaxing, everyone. It's a fun story, not a groundbreaking discovery.

sytelus · on June 21, 2023

Stunning: https://chat.openai.com/share/367fd09d-79cb-4683-a179-e253a8...

aenis · on June 21, 2023

In case anyone is wondering, gpt4 provides the same solution. It takes some prodding though, and you get treated with a lecture of how inefficient it is before it provides the solution.

"Apologies for any confusion before, and I understand the nature of your request now. It's a fun challenge to think about! While it's not practically useful or efficient, and it is actually quite limited, a creative theoretical solution exists that involves regular expressions. This specific solution only works for very small numbers due to the inherent limitations of regex. It's based on the principle that a number is prime if it cannot be expressed as a product of two smaller natural numbers. We represent the number as a repeated string of 1s, then check for repeating sub-patterns.

For instance, this Perl regex checks if a number, represented as unary (a string of 1s), is prime:

perl Copy code /^1?$|^(11+?)\1+$/ In this regex:

^1?$ checks for the case where you have zero or one '1', which represent the numbers 0 and 1, neither of which are prime. ^(11+?)\1+$ attempts to match the input with a repeating group of '1's. If a match is found, the number is not prime, because a repeating subgroup of '1's represents a divisor. This pattern should be used in a negative sense: if a string of 1s (representing your number in unary form) matches this pattern, it's not prime. If it doesn't match, then it is prime.

Again, this is quite impractical for real use cases as the unary representation is extremely inefficient, and the computation grows quickly with larger numbers, but it's a neat way of showing how flexible and powerful regular expressions can be."