Patch for lookaround assertion in regexp

Post by Colin Fraizer
In 2009, Tomohiro Matsuyama sent a message to this list with a patch
to add lookahead/lookbehind assertions to Emacs regular expressions
(regexps). Is there any plan to incorporate this useful feature into
an official release?
[IMHO, it is incredibly useful. The only responses to Matsuyama-san's
message were positive.]

I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
Every feature added to the current regexp engine will make it more
difficult to switch, so I'm not too thrilled at the idea of adding
lookaround operators.
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.

Stefan

Tom

2012-01-23 14:44:21 UTC

Post by Stefan Monnier
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.

Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.

Why reinvent the wheel when PCRE is already there?

Andreas Schwab

2012-01-23 14:50:47 UTC

Post by Tom
Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.

Does PCRE implement \c and \s? Does PCRE provide an interface for
searching a memory region with a gap?

Andreas.

--
Andreas Schwab, ***@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Tom

2012-01-23 15:19:58 UTC

Does PCRE implement \c and \s?

If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.

I don't know how efficient it would be, it should be tested, but
pcre can compile regexps too. I don't know if emacs uses some precompiled
format internally if the same regexp is used again and again.

Post by Andreas Schwab
Does PCRE provide an interface for searching a memory region with a gap?

I don't know it should be checked, but other editors use PCRE as their
regex search engine, so there may be some support for that.

Andreas Schwab

2012-01-23 16:14:49 UTC

Post by Tom

Does PCRE implement \c and \s?

If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.

Enumerating the syntax/category members is not an option. There is no
easy way to do that.

Andreas.

--
Andreas Schwab, ***@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Stefan Monnier

2012-01-23 17:11:26 UTC

Post by Tom
If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.

Enumerating the syntax/category members is not an option.

Indeed.

Post by Andreas Schwab
There is no easy way to do that.

For `categories', there is a way, but the result is a *very* large [...]
chunk, so it's impractical. For `syntax' there is indeed no way, since
the syntax of a char doesn't only depend on the char itself but also of
the `char-table' text-property that might be applied to that particular
character position (and of course, if we ignore this problem, we're
still back to the same problem of enormous [...] expressions, as is the
case for categories).
These entities really need to be implemented inside the regexp-engine
(but they're usually pretty easy to implement there).

Stefan

Štěpán Němec

2012-01-23 18:45:00 UTC

On Mon, 23 Jan 2012 18:11:26 +0100

Post by Tom
If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.

Enumerating the syntax/category members is not an option.

Indeed.

Post by Andreas Schwab
There is no easy way to do that.

OTOH using something like PCRE would finally fix the currently erroneous
implementation of classes like [:space:], which now is the same as \s-.
(And personally I would gladly forgo the syntax categories for standard
[:classes:], although I imagine the former might be used by the
font-locking or somewhere... I never felt the need for them.)

--
Štěpán

Juri Linkov

2012-01-30 00:31:56 UTC

Post by Å tÄpÃ¡n NÄmec
OTOH using something like PCRE would finally fix the currently erroneous
implementation of classes like [:space:], which now is the same as \s-.
(And personally I would gladly forgo the syntax categories for standard
[:classes:], although I imagine the former might be used by the
font-locking or somewhere... I never felt the need for them.)

`pcrecallout' could help to translate \c and \s to PCRE.

Stefan Monnier

2012-01-23 15:31:52 UTC

Post by Tom

Post by Stefan Monnier
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.

AFAIK PCRE uses a backtracking implementation, so suffers from the same
exponential blowup as the current code.

Post by Tom
Why reinvent the wheel when PCRE is already there?

I didn't mean to say that I want a fresh new engine written
from scratch. But adding pcre to the list of libraries linked with
Emacs is not nearly enough: someone has to write the bridge layer and it
may turn out not to be as simple as it seems.

Stefan

Nikolai Weibull

2012-01-24 08:41:11 UTC

I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.

As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?

http://code.google.com/p/re2/

It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.

Stefan Monnier

2012-01-24 14:40:38 UTC

Post by Nikolai Weibull

As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?
http://code.google.com/p/re2/
It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.

That might work, indeed (tho someone still has to write the
corresponding code).
Note that it does not support lookaround assertions.

Stefan

Nikolai Weibull

2012-01-24 15:09:36 UTC

Post by Nikolai Weibull
As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?
http://code.google.com/p/re2/
It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.

That might work, indeed (tho someone still has to write the
corresponding code).
Note that it does not support lookaround assertions.

True, but you can, as far as I know, not do so without (allowing for)
exponential behavior.

I don’t want to detract from the merits of lookaround assertions (or
start a discussion on the subject), but I’ve always found them to be a
sign of improper use of (no longer) regular expressions.

Stefan Monnier

2012-01-24 17:34:58 UTC

Post by Nikolai Weibull

That might work, indeed (tho someone still has to write the
corresponding code).
Note that it does not support lookaround assertions.

True, but you can, as far as I know, not do so without (allowing for)
exponential behavior.

Actually, no. Contrary to backreferences (which are outside of the
mathematical notion of regular expressions, and can't be matched in
linear time), lookahead assertions are "normal". So RE2 may get support
for lookahead assertions in the future (maybe for lookbehind as well,
tho that's more difficult).

Post by Nikolai Weibull
I don’t want to detract from the merits of lookaround assertions (or
start a discussion on the subject), but I’ve always found them to be a
sign of improper use of (no longer) regular expressions.

I just pointed it out as supporting my argument that I'd rather not add
lookaround assertions since it may make it more difficult to change to
an linear-time matcher later on.

Stefan

David De La Harpe Golden

2012-01-24 23:27:01 UTC

If we're mentioning engines, CL-PPCRE by Dr. Edi Weitz is _in_ (common)
lisp and generally nice (also 2-clause BSD-style licensed):
http://weitz.de/cl-ppcre/

Nikolai Weibull

2012-01-25 06:07:22 UTC

On Wed, Jan 25, 2012 at 00:27, David De La Harpe Golden

If we're mentioning engines, CL-PPCRE by Dr. Edi Weitz is _in_ (common) lisp
http://weitz.de/cl-ppcre/

CL-PPCRE is implemented using a backtracing NFA, so it’s not suited to
Stefan’s requirements. It would, had it not been for this fact, been
a nice alternative.

Daniel Colascione

2012-02-14 17:53:07 UTC

Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
because it's single-pass and can work over arbitrary kinds of input
stream (like characters in a gap buffer). TRE's approximate matching
support is occasionally useful as well.

That said, I'd actually prefer to head in the other direction and
allow users to express arbitrarily rich grammars using an extended rx
syntax. The idea is basically to support parser combinator grammars,
which can be efficiently matched using a scannerless GRL parser, which
is O(N^3) time, or a memozing and backtracking "packrat" parser, which
is O(N) time and O(n) space. The end result would look a bit like Perl
6's rules.

I have some ultra-preliminary work at
https://github.com/dcolascione/jezebel.

Basically, instead of matching regular expressions more efficiently,
we should add support for efficient parsing using
declaratively-constructed grammars, of which regular expressions are
simply a special case. Such support would be incredibly useful for
processing languages like Perl and C++.

[1] http://laurikari.net/tre/about/
[2] http://laurikari.net/ville/regex-submatch.pdf

Stefan Monnier

2012-02-14 18:36:32 UTC

Post by Daniel Colascione
Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
because it's single-pass and can work over arbitrary kinds of input
stream (like characters in a gap buffer). TRE's approximate matching
support is occasionally useful as well.

I'm familiar with the work, yes. TRE seemed like the best option last
time I looked around.

Post by Daniel Colascione
That said, I'd actually prefer to head in the other direction and
allow users to express arbitrarily rich grammars using an extended rx
syntax.

I think that would be orthogonal: we want regexp support because it's
efficient (yes, our current implementation is super slow in some cases,
but it's also efficient in many important cases).

I also would like a new regexp engine to fix the "backward matching"
problem so that looking-back can work the way most people would expect,
and doesn't need a `greedy' hack. The fact that regexps are symmetric
is a very neat property (operator precedence grammars enjoy the same
property, which is one of the reasons why I chose them as the basis for
SMIE).

Post by Daniel Colascione
The idea is basically to support parser combinator grammars,
which can be efficiently matched using a scannerless GRL parser, which
is O(N^3) time, or a memozing and backtracking "packrat" parser, which
is O(N) time and O(n) space. The end result would look a bit like Perl
6's rules.

While these are algorithmically reasonably efficient, it can be
difficult to make them as efficient as a regexp-only matcher for many
typical regexps. Also it can be difficult to make them work backwards.
IOW I don't think that can replace regexps given the amount of regexps
out there we have to support.

Stefan

Dimitri Fontaine

2012-02-20 16:19:46 UTC

There's also the Henry Spencer regexp code in use in TCL and PostgreSQL,
where providing an independent library for it is currently being
discussed. See this email and its thread:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00782.php

Regards,

--
dim

Tomohiro Matsuyama

2012-09-26 06:55:41 UTC