Post by Stefan MonnierPost by Colin FraizerIn 2009, Tomohiro Matsuyama sent a message to this list with a patch
to add lookahead/lookbehind assertions to Emacs regular expressions
(regexps). Is there any plan to incorporate this useful feature into
an official release?
[IMHO, it is incredibly useful. The only responses to Matsuyama-san's
message were positive.]
I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
Every feature added to the current regexp engine will make it more
difficult to switch, so I'm not too thrilled at the idea of adding
lookaround operators.
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.
Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
because it's single-pass and can work over arbitrary kinds of input
stream (like characters in a gap buffer). TRE's approximate matching
support is occasionally useful as well.
That said, I'd actually prefer to head in the other direction and
allow users to express arbitrarily rich grammars using an extended rx
syntax. The idea is basically to support parser combinator grammars,
which can be efficiently matched using a scannerless GRL parser, which
is O(N^3) time, or a memozing and backtracking "packrat" parser, which
is O(N) time and O(n) space. The end result would look a bit like Perl
6's rules.
I have some ultra-preliminary work at
https://github.com/dcolascione/jezebel.
Basically, instead of matching regular expressions more efficiently,
we should add support for efficient parsing using
declaratively-constructed grammars, of which regular expressions are
simply a special case. Such support would be incredibly useful for
processing languages like Perl and C++.
[1] http://laurikari.net/tre/about/
[2] http://laurikari.net/ville/regex-submatch.pdf