Discussion:
Patch for lookaround assertion in regexp
Colin Fraizer
2012-01-23 11:17:34 UTC
Permalink
In 2009, Tomohiro Matsuyama sent a message to this list with a patch to add
lookahead/lookbehind assertions to Emacs regular expressions (regexps). Is
there any plan to incorporate this useful feature into an official release?

[IMHO, it is incredibly useful. The only responses to Matsuyama-san's
message were positive.]

Thanks,
--Colin
Stefan Monnier
2012-01-23 14:11:09 UTC
Permalink
Post by Colin Fraizer
In 2009, Tomohiro Matsuyama sent a message to this list with a patch
to add lookahead/lookbehind assertions to Emacs regular expressions
(regexps). Is there any plan to incorporate this useful feature into
an official release?
[IMHO, it is incredibly useful. The only responses to Matsuyama-san's
message were positive.]
I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
Every feature added to the current regexp engine will make it more
difficult to switch, so I'm not too thrilled at the idea of adding
lookaround operators.
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.


Stefan
Tom
2012-01-23 14:44:21 UTC
Permalink
Post by Stefan Monnier
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.
Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.

Why reinvent the wheel when PCRE is already there?
Andreas Schwab
2012-01-23 14:50:47 UTC
Permalink
Post by Tom
Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.
Does PCRE implement \c and \s? Does PCRE provide an interface for
searching a memory region with a gap?

Andreas.
--
Andreas Schwab, ***@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
Tom
2012-01-23 15:19:58 UTC
Permalink
Post by Andreas Schwab
Post by Tom
Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.
Does PCRE implement \c and \s?
If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.

I don't know how efficient it would be, it should be tested, but
pcre can compile regexps too. I don't know if emacs uses some precompiled
format internally if the same regexp is used again and again.
Post by Andreas Schwab
Does PCRE provide an interface for searching a memory region with a gap?
I don't know it should be checked, but other editors use PCRE as their
regex search engine, so there may be some support for that.
Andreas Schwab
2012-01-23 16:14:49 UTC
Permalink
Post by Tom
Post by Andreas Schwab
Post by Tom
Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.
Does PCRE implement \c and \s?
If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.
Enumerating the syntax/category members is not an option. There is no
easy way to do that.

Andreas.
--
Andreas Schwab, ***@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
Stefan Monnier
2012-01-23 17:11:26 UTC
Permalink
Post by Andreas Schwab
Post by Tom
If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.
Enumerating the syntax/category members is not an option.
Indeed.
Post by Andreas Schwab
There is no easy way to do that.
For `categories', there is a way, but the result is a *very* large [...]
chunk, so it's impractical. For `syntax' there is indeed no way, since
the syntax of a char doesn't only depend on the char itself but also of
the `char-table' text-property that might be applied to that particular
character position (and of course, if we ignore this problem, we're
still back to the same problem of enormous [...] expressions, as is the
case for categories).
These entities really need to be implemented inside the regexp-engine
(but they're usually pretty easy to implement there).


Stefan
Štěpán Němec
2012-01-23 18:45:00 UTC
Permalink
On Mon, 23 Jan 2012 18:11:26 +0100
Post by Stefan Monnier
Post by Andreas Schwab
Post by Tom
If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.
Enumerating the syntax/category members is not an option.
Indeed.
Post by Andreas Schwab
There is no easy way to do that.
For `categories', there is a way, but the result is a *very* large [...]
chunk, so it's impractical. For `syntax' there is indeed no way, since
the syntax of a char doesn't only depend on the char itself but also of
the `char-table' text-property that might be applied to that particular
character position (and of course, if we ignore this problem, we're
still back to the same problem of enormous [...] expressions, as is the
case for categories).
These entities really need to be implemented inside the regexp-engine
(but they're usually pretty easy to implement there).
OTOH using something like PCRE would finally fix the currently erroneous
implementation of classes like [:space:], which now is the same as \s-.
(And personally I would gladly forgo the syntax categories for standard
[:classes:], although I imagine the former might be used by the
font-locking or somewhere... I never felt the need for them.)
--
Štěpán
Juri Linkov
2012-01-30 00:31:56 UTC
Permalink
Post by Štěpán Němec
OTOH using something like PCRE would finally fix the currently erroneous
implementation of classes like [:space:], which now is the same as \s-.
(And personally I would gladly forgo the syntax categories for standard
[:classes:], although I imagine the former might be used by the
font-locking or somewhere... I never felt the need for them.)
`pcrecallout' could help to translate \c and \s to PCRE.
Stefan Monnier
2012-01-23 15:31:52 UTC
Permalink
Post by Tom
Post by Stefan Monnier
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.
Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.
AFAIK PCRE uses a backtracking implementation, so suffers from the same
exponential blowup as the current code.
Post by Tom
Why reinvent the wheel when PCRE is already there?
I didn't mean to say that I want a fresh new engine written
from scratch. But adding pcre to the list of libraries linked with
Emacs is not nearly enough: someone has to write the bridge layer and it
may turn out not to be as simple as it seems.


Stefan
Nikolai Weibull
2012-01-24 08:41:11 UTC
Permalink
Post by Stefan Monnier
Post by Colin Fraizer
In 2009, Tomohiro Matsuyama sent a message to this list with a patch
to add lookahead/lookbehind assertions to Emacs regular expressions
(regexps).  Is there any plan to incorporate this useful feature into
an official release?
I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.
As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?

http://code.google.com/p/re2/

It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.
Stefan Monnier
2012-01-24 14:40:38 UTC
Permalink
Post by Nikolai Weibull
Post by Stefan Monnier
Post by Colin Fraizer
In 2009, Tomohiro Matsuyama sent a message to this list with a patch
to add lookahead/lookbehind assertions to Emacs regular expressions
(regexps).  Is there any plan to incorporate this useful feature into
an official release?
I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.
As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?
http://code.google.com/p/re2/
It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.
That might work, indeed (tho someone still has to write the
corresponding code).
Note that it does not support lookaround assertions.


Stefan
Nikolai Weibull
2012-01-24 15:09:36 UTC
Permalink
Post by Stefan Monnier
Post by Nikolai Weibull
As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?
http://code.google.com/p/re2/
It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.
That might work, indeed (tho someone still has to write the
corresponding code).
Note that it does not support lookaround assertions.
True, but you can, as far as I know, not do so without (allowing for)
exponential behavior.

I don’t want to detract from the merits of lookaround assertions (or
start a discussion on the subject), but I’ve always found them to be a
sign of improper use of (no longer) regular expressions.
Stefan Monnier
2012-01-24 17:34:58 UTC
Permalink
Post by Nikolai Weibull
Post by Stefan Monnier
Post by Nikolai Weibull
As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?
http://code.google.com/p/re2/
It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.
That might work, indeed (tho someone still has to write the
corresponding code).
Note that it does not support lookaround assertions.
True, but you can, as far as I know, not do so without (allowing for)
exponential behavior.
Actually, no. Contrary to backreferences (which are outside of the
mathematical notion of regular expressions, and can't be matched in
linear time), lookahead assertions are "normal". So RE2 may get support
for lookahead assertions in the future (maybe for lookbehind as well,
tho that's more difficult).
Post by Nikolai Weibull
I don’t want to detract from the merits of lookaround assertions (or
start a discussion on the subject), but I’ve always found them to be a
sign of improper use of (no longer) regular expressions.
I just pointed it out as supporting my argument that I'd rather not add
lookaround assertions since it may make it more difficult to change to
an linear-time matcher later on.


Stefan
David De La Harpe Golden
2012-01-24 23:27:01 UTC
Permalink
Post by Nikolai Weibull
As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?
http://code.google.com/p/re2/
It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.
If we're mentioning engines, CL-PPCRE by Dr. Edi Weitz is _in_ (common)
lisp and generally nice (also 2-clause BSD-style licensed):
http://weitz.de/cl-ppcre/
Nikolai Weibull
2012-01-25 06:07:22 UTC
Permalink
On Wed, Jan 25, 2012 at 00:27, David De La Harpe Golden
Post by Nikolai Weibull
As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?
http://code.google.com/p/re2/
It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.
If we're mentioning engines, CL-PPCRE by Dr. Edi Weitz is _in_ (common) lisp
http://weitz.de/cl-ppcre/
CL-PPCRE is implemented using a backtracing NFA, so it’s not suited to
Stefan’s requirements. It would, had it not been for this fact, been
a nice alternative.
Daniel Colascione
2012-02-14 17:53:07 UTC
Permalink
Post by Stefan Monnier
Post by Colin Fraizer
In 2009, Tomohiro Matsuyama sent a message to this list with a patch
to add lookahead/lookbehind assertions to Emacs regular expressions
(regexps). Is there any plan to incorporate this useful feature into
an official release?
[IMHO, it is incredibly useful. The only responses to Matsuyama-san's
message were positive.]
I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
Every feature added to the current regexp engine will make it more
difficult to switch, so I'm not too thrilled at the idea of adding
lookaround operators.
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.
Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
because it's single-pass and can work over arbitrary kinds of input
stream (like characters in a gap buffer). TRE's approximate matching
support is occasionally useful as well.

That said, I'd actually prefer to head in the other direction and
allow users to express arbitrarily rich grammars using an extended rx
syntax. The idea is basically to support parser combinator grammars,
which can be efficiently matched using a scannerless GRL parser, which
is O(N^3) time, or a memozing and backtracking "packrat" parser, which
is O(N) time and O(n) space. The end result would look a bit like Perl
6's rules.

I have some ultra-preliminary work at
https://github.com/dcolascione/jezebel.

Basically, instead of matching regular expressions more efficiently,
we should add support for efficient parsing using
declaratively-constructed grammars, of which regular expressions are
simply a special case. Such support would be incredibly useful for
processing languages like Perl and C++.

[1] http://laurikari.net/tre/about/
[2] http://laurikari.net/ville/regex-submatch.pdf
Stefan Monnier
2012-02-14 18:36:32 UTC
Permalink
This post might be inappropriate. Click to display it.
Dimitri Fontaine
2012-02-20 16:19:46 UTC
Permalink
Post by Daniel Colascione
Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
There's also the Henry Spencer regexp code in use in TCL and PostgreSQL,
where providing an independent library for it is currently being
discussed. See this email and its thread:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00782.php

Regards,
--
dim
Tomohiro Matsuyama
2012-09-26 06:55:41 UTC
Permalink
Just reminder.

I'd like to see soon the regexp lookaround assertion feature in trunk.
As Stefan said, the feature could be a bottleneck of the regexp engine replacement.
So, when such the replacement would happen? I'm feeling I have to wait more 10 years.

I'm also skeptical in technical that Tompson's NFA is really good for Emacs.
In my understand, Tompson's NFA is an algorithm that reduces backtracks with restricting dynamism, and may require cpu and memory overhead in many cases.
Is that really what we want for regexp engines to improve scalability?

Tomohiro

Loading...