Now you had me hoping for an implementation of fitting Needle behind the existing Pattern/Matcher API, and a compiler plugin that substitutes Pattern.compile accordingly wherever the inputs are known at compile time.
Author here: one thing I didn't mention in the post, but which is mentioned in the issues/readme on GitHub is that while I've been working on this for awhile, it's still a pretty new project in many ways. So what you're describing is interesting, but probably premature. The generated class files can still be very large, and the compiler itself has comments like "TODO: O(n^2)" that make throwing arbitrary input at it a dodgy idea.
Beyond those factors, there's the matter of just proving out that my testing is good enough, and I haven't missed edge cases.
By looking at the benchmarks, it is interesting to see how much an advantage have non-backtracking engines versus the default Java one. It is also interesting that most of the regular expressions I happen to write are not backtracking, so it’s likely I’m paying a ticket every time for a feature I’m not going to see.
It would be interesting if the default Java class the implements regular expressions would detect whether backtracking is actually used in the regular expression when it is compiled and decide whether the full blown engine is really necessary or a lightweight one would be enough.
The backtracking engines ahead of are pcre2/jit, javascript/v8, d/ldc/std-regex (technically a hybrid I believe) and regress. Java's engine is about on par with Python's and Perl's (which are both written in C).
I’d definitely like to get needle into rebar (or a branch of it) and be able to run a broader set of comparisons.
Also, I’ve learned a lot from your code. A lot of this project was stubbornly banging my head against the problem, but whenever I wanted to see how another engine did things, your library was the one I looked at.
C# does this as well. Its OOB Regex engine can compile expressions to automata at runtime, or there’s an alternate newer back-end to generate the source code at compile time, which is useful in AOT as you can’t JIT the IL there.
looks very interesting - coming from the perl (ie the original PCRE) it’s rather a pity that this is focused on the JVM since Java is not the most regex friendly authoring environment (sorry I do not know if other JVM denizens like Kotlin / Clojure have more natural regex syntax)
Larry Walls version 2 of PCRE is doing some interesting innovation which, while continuing the model of a character level syntax where punctuation <[.+*]> and so on are the “verbs” as the regex Sub-Language (one of the Slangs of raku), now provides deep support for unicode (eg. a digit test looks at the unicode properties, not just <[0..9]>), and can be gradually unpacked using <token> and <rule> methods all the way up to a full blown composable class based Grammar … so you can still write regex one-liners perl style but then evolve that to a small Grammar to reduce line noise and drive up code clarity
Are there any plans to provide your library on the MoarVM?
Author here. I don't know much about the MoarVM, so no plans.
I have very idly thought about being able to output native code for use in other systems, but realistically, I have enough other ideas to keep my nights and weekends busy for a long time.
[1] https://github.com/humio/jitrex [1] https://github.com/humio/jitrex?tab=readme-ov-file#performan...