Needle: A DFA Based Regex Library That Compiles to JVM ByteCode

svilenivanov · 2024-05-08T10:58:58 1715165938

FWIW, there is a similar project [1] which compiles a regex to bytecode. The benchmarks [2] are impressive.

[1] https://github.com/humio/jitrex [1] https://github.com/humio/jitrex?tab=readme-ov-file#performan...

hyperpape · 2024-05-08T11:34:15 1715168055

Interesting—I had seen several different regex libraries for the JVM, but not this one.

BoingBoomTschak · 2024-05-08T09:48:45 1715161725

A related project that might interest some people: https://github.com/telekons/one-more-re-nightmare

And the pretty hard to find blog post about it: https://applied-langua.ge/posts/omrn-compiler.html

usrusr · 2024-05-08T10:18:32 1715163512

Now you had me hoping for an implementation of fitting Needle behind the existing Pattern/Matcher API, and a compiler plugin that substitutes Pattern.compile accordingly wherever the inputs are known at compile time.

hyperpape · 2024-05-08T10:56:27 1715165787

Author here: one thing I didn't mention in the post, but which is mentioned in the issues/readme on GitHub is that while I've been working on this for awhile, it's still a pretty new project in many ways. So what you're describing is interesting, but probably premature. The generated class files can still be very large, and the compiler itself has comments like "TODO: O(n^2)" that make throwing arbitrary input at it a dodgy idea.

Beyond those factors, there's the matter of just proving out that my testing is good enough, and I haven't missed edge cases.

kamma4434 · 2024-05-08T10:43:03 1715164983

By looking at the benchmarks, it is interesting to see how much an advantage have non-backtracking engines versus the default Java one. It is also interesting that most of the regular expressions I happen to write are not backtracking, so it’s likely I’m paying a ticket every time for a feature I’m not going to see.

It would be interesting if the default Java class the implements regular expressions would detect whether backtracking is actually used in the regular expression when it is compiled and decide whether the full blown engine is really necessary or a lightweight one would be enough.

burntsushi · 2024-05-08T12:37:52 1715171872

The set of regex engines being compared here is pretty small, and even among backtracking regex engines, Java's is pretty slow. See: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...

The backtracking engines ahead of are pcre2/jit, javascript/v8, d/ldc/std-regex (technically a hybrid I believe) and regress. Java's engine is about on par with Python's and Perl's (which are both written in C).

hyperpape · 2024-05-08T13:19:09 1715174349

I’d definitely like to get needle into rebar (or a branch of it) and be able to run a broader set of comparisons.

Also, I’ve learned a lot from your code. A lot of this project was stubbornly banging my head against the problem, but whenever I wanted to see how another engine did things, your library was the one I looked at.

burntsushi · 2024-05-08T14:58:25 1715180305

Yeah! Please file issues or ask questions if you need help. :-)

Hopefully it's easy enough to follow the `java` example to add your own.

neonsunset · 2024-05-08T08:48:23 1715158103

C# does this as well. Its OOB Regex engine can compile expressions to automata at runtime, or there’s an alternate newer back-end to generate the source code at compile time, which is useful in AOT as you can’t JIT the IL there.

Similar to the blog post, this allows it to be one of the fastest: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...

librasteve · 2024-05-08T08:12:03 1715155923

looks very interesting - coming from the perl (ie the original PCRE) it’s rather a pity that this is focused on the JVM since Java is not the most regex friendly authoring environment (sorry I do not know if other JVM denizens like Kotlin / Clojure have more natural regex syntax)

Larry Walls version 2 of PCRE is doing some interesting innovation which, while continuing the model of a character level syntax where punctuation <[.+*]> and so on are the “verbs” as the regex Sub-Language (one of the Slangs of raku), now provides deep support for unicode (eg. a digit test looks at the unicode properties, not just <[0..9]>), and can be gradually unpacked using <token> and <rule> methods all the way up to a full blown composable class based Grammar … so you can still write regex one-liners perl style but then evolve that to a small Grammar to reduce line noise and drive up code clarity

Are there any plans to provide your library on the MoarVM?

hyperpape · 2024-05-08T10:46:09 1715165169

Author here. I don't know much about the MoarVM, so no plans.

I have very idly thought about being able to output native code for use in other systems, but realistically, I have enough other ideas to keep my nights and weekends busy for a long time.

MrBuddyCasino · 2024-05-08T13:08:03 1715173683

Kotlin isn't too bad, you can avoid double-quoting and use .toRegex() on strings:

    val regex = """([\w\s]+) is (\d+) years old""".toRegex()

CoastalCoder · 2024-05-08T11:08:42 1715166522

> the Slangs of raku

Sounds like the name of some race in a sci-fi story :)

exabrial · 2024-05-08T22:27:07 1715207227

> Version 0.0.1

Thank you for semver-ing!