Discussion:
[PATCH] add 'string-distance' to calculate Levenshtein distance
(too old to reply)
Chen Bin
2018-04-14 02:35:08 UTC
Permalink
Hello,
Attached patch implemented 'string-distance' is much faster then
existing Lisp implementation like 'org-babel-edit-distance'.

I benchmarked by below snippet:

(setq s1 "projs1/emacs/etc/imeees/icon/emacs-document.svg")
(setq s2 "projs2/test/etc/images/icons/emacs-document23.svg")
(let* ((gc-cons-threshold most-positive-fixnum))
(message "%S vs %S"
(benchmark-run-compiled 100
(org-babel-edit-distance s1 s2))
(benchmark-run-compiled 100
(string-distance s1 s2))))

Got:
(0.506494223 0 0.0) vs (0.001317414 0 0.0)
Eli Zaretskii
2018-04-14 07:05:10 UTC
Permalink
Date: Sat, 14 Apr 2018 12:35:08 +1000
Attached patch implemented 'string-distance' is much faster then
existing Lisp implementation like 'org-babel-edit-distance'.
Thanks, please see a few comments below.
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
+ doc: /* Return Levenshtein distance of two strings.
Please mention the arguments in the first line of the doc string.
+Case is significant, but text properties are ignored. */)
+ (register Lisp_Object string1, Lisp_Object string2)
We don't like to use 'register' in new code, we prefer leaving this to
the compiler.
+ CHECK_STRING (string1);
+ CHECK_STRING (string2);
+
+ char *s1 = SSDATA (string1);
+ char *s2 = SSDATA (string2);
+ unsigned int s1len, s2len, x, y, lastdiag, olddiag;
+ s1len = strlen(s1);
+ s2len = strlen(s2);
I presume this function is supposed to work only on strings in their
internal representation (a.k.a. "multibyte strings"), so the function
should check that and signal an error if that's not so.
Alternatively, check that either both strings are unibyte or both are
multibyte, and signal an error if not.
+ unsigned int column[s1len+1];
I'm not sure this is portable enough, but even if it is, it's not a
good idea to unconditionally make automatic variables in this case,
because s1len and s2len could be very large, in which case you will
get stack overflow. Please use SAFE_ALLOCA instead.
+ for (y = 1; y <= s1len; y++)
+ column[y] = y;
+ for (x = 1; x <= s2len; x++) {
+ column[0] = x;
+ for (y = 1, lastdiag = x-1; y <= s1len; y++) {
+ olddiag = column[y];
+ column[y] = min3(column[y] + 1, column[y-1] + 1, lastdiag + (s1[y-1] == s2[x-1] ? 0 : 1));
+ lastdiag = olddiag;
Is this algorithm really better than allocating just the diag matrix
(or part thereof), and accessing the string data directly, without
copying them inti local arrays?
+ return make_number(column[s1len]);
Style: we leave a single blank between the function's name and the
following opening parenthesis (you have a few of these scattered
through the code).

Finally, this new primitives needs documentation (NEWS and the ELisp
manual), and also a couple of tests in the test suite.

Thanks again for working on this.
Eli Zaretskii
2018-04-14 13:24:41 UTC
Permalink
[Please CC the mailing list when you respond, so others could see your
messages.]
Date: Sat, 14 Apr 2018 21:01:46 +1000
Hi, Eli,
Thanks for the review.
I fixed most issues except two things.
1. In Asia, it's possible to cacluate distance between one unibyte and
one multibyte string. As a Chinese, I might create a document containing
Hanzi characters whose file name is obviously multibyte string. I may
need get the distance of this document to a file named "README.txt".
If you mean unibyte pure-ASCII strings, then I agree. But that
doesn't mean we should avoid the text altogether, because we might
compute non-zero distance between a string and its encoded unibyte
variant, which will confuse users. At the very least the doc string
should say something about this.
2. Algorithm is based on https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C
It's optimized to use O(min(m,n)) space instead of O(mn).
Say you compare two string whose string length is 512 bytes.
You only need allocate 512 bytes instead of 262K (512*512)
in memory.
Please check attached patch for latest code.
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -463,6 +463,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
+++
** New function assoc-delete-all.
+** New function string-distance.
This should mention Levenshtein distance.
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
+ doc: /* Return Levenshtein distance of STRING1 and STRING2.
^^^^^^^^^^^^^^^^^^^^^^
"between STRING1 and STRING2"
+ unsigned int s1len, s2len, x, y, lastdiag, olddiag;
These variables should be declared EMACS_INT, not unsigned int,
because the size of Emacs strings can be larger than UINT_MAX,
especially on 64-bit systems.
+ unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof (unsigned int));
Likewise here.
+ char *s1 = SSDATA (string1);
+ char *s2 = SSDATA (string2);
+
+ unsigned int s1len, s2len, x, y, lastdiag, olddiag;
+ s1len = strlen(s1);
+ s2len = strlen(s2);
You could optimize the code by using SCHARS and SBYTES, instead of
calling strlen.
+(ert-deftest subr-tests--string-distance ()
+ "Test `string-distance' behavior."
+ (should (equal 1 (string-distance "heelo" "hello")))
+ (should (equal 2 (string-distance "aeelo" "hello")))
+ (should (equal 0 (string-distance "ab" "ab")))
+ (should (equal 1 (string-distance "ab" "abc"))))
Could you please add a test or two with non-ASCII characters?

Thanks.
Chen Bin
2018-04-14 16:40:18 UTC
Permalink
Correct me if I'm wrong.

I read cod eand found definion of Lisp_String:
struct GCALIGNED Lisp_String
{
ptrdiff_t size;
ptrdiff_t size_byte;
INTERVAL intervals; /* Text properties in this string. */
unsigned char *data;
};

I understand string text is encoded in UTF8 format and is stored in
'Lisp_String::data'. There is actually no difference between unibyte
and multibyte text since UTF8 is compatible with ASCII and we only deal
with 'data' field.

I attached the latest patch.
Eli Zaretskii
2018-04-14 17:08:51 UTC
Permalink
Date: Sun, 15 Apr 2018 02:40:18 +1000
Correct me if I'm wrong.
struct GCALIGNED Lisp_String
{
ptrdiff_t size;
ptrdiff_t size_byte;
INTERVAL intervals; /* Text properties in this string. */
unsigned char *data;
};
I understand string text is encoded in UTF8 format and is stored in
'Lisp_String::data'. There is actually no difference between unibyte
and multibyte text since UTF8 is compatible with ASCII and we only deal
with 'data' field.
No, that's incorrect. The difference does exist, it just all but
disappear for unibyte strings encoded in UTF-8. But if you encode a
string in some other encoding, like Latin-1, you will see a very
different stream of bytes.
I attached the latest patch.
Thanks.
+ ;; string containing unicode character (Hanzi)
+ (should (equal 6 (string-distance "ab" "ab我她")))
+ (should (equal 3 (string-distance "我" "她"))))
Should the distance be measured in bytes or in characters? I think
it's the latter, in which case the implementation should work in
characters, not bytes.
Chen Bin
2018-04-15 07:15:49 UTC
Permalink
I attached patch for latest code.

As required, now by default we compare the string by character. So
it can handle multibyte character correctly.

'string-distance' has third optional parameter 'bytecompare' which
use byte comparing instead of default character comparing.

I did read this thread carefully and I did know different algorithms
exist to calculate string distance.

But I feel only my code is the right solution to our original problem.

Our original problem is to provide a faster C solution to calculate
*Levenshtein distance*.

It will replace 3rd party Lisp solution like 'org-babel-edit-distance'.

As 'org-babel-edit-distance' documented, it will "Return the edit
(levenshtein) distance between strings S1 S2". So the problem here is to
calculate *Levenshtein distance*.

Other algorithms have their own definition of "string distance" so they
are not for *Levenshtein distance*.

Here is link of "Comparison of String Distance Algorithms":

https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
Eli Zaretskii
2018-04-15 14:47:00 UTC
Permalink
Date: Sun, 15 Apr 2018 17:15:49 +1000
I attached patch for latest code.
Thanks, we are getting close.
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+ doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, we compare character of strings.
+If BYTECOMPARE is t, we compare byte of strings.
+Case is significant, but text properties are ignored. */)
+ (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)
I question the need for the BYTECOMPARE flag. Emacs's editing
operations work in characters, not in bytes. There's insert-byte, but
no delete-byte or replace-byte (although applications which should for
some reason need that could implement that, albeit not conveniently).
The byte-level operations are not exposed to Lisp for a good reason:
Emacs is primarily a text-processing environment, and text is
processed in character units.

So I think you should remove that option, unless you can explain why
you think it's needed and in what situations.
+ unsigned short *ws1 = 0; /* 16 bit unicode character */
+ unsigned short *ws2 = 0; /* 16 bit unicode character */
+ if(!use_bytecompare)
+ {
+ /* convert utf-8 byte stream to 16 bit unicode array */
+ string1 = code_convert_string_norecord (string1, Qutf_16le, 1);
+ ws1 = (unsigned short *) SDATA (string1);
+ string2 = code_convert_string_norecord (string2, Qutf_16le, 1);
+ ws2 = (unsigned short *) SDATA (string2);
+ }
Conversion to UTF-16 burns cycles, and the function will do the wrong
thing for characters beyond the BMP as result, because you compare two
16-bit words instead of full characters.

Instead, please use the macros defined on character.h, either
CHAR_STRING_ADVANCE of FETCH_STRING_CHAR_ADVANCE (or their non-ADVANCE
counterparts), whichever better suits your coding style and needs.
These macros produce the full Unicode codepoint of the string
characters, and you can then compare them without bumping into the
problem with UTF-8 or UTF-16 encoding. The *-ADVANCE variants also
advance the pointer or index to the next character as side effect,
which is handy when examining successive characters in a loop.
chen bin
2018-04-17 02:43:33 UTC
Permalink
'bytecompare' only applies the original Levenshtein distance algorihtm
to the ascii strings. It's not enabled by default.

As we agreed, it has some pleasant side effect for East Asian. It's not right
and wrong thing. Just side effect of UTF-8 encoding.

Weighted distance is NOT doable for only one reason:
You can't increase character weight of certain race or religion in Emacs.
It's politically wrong.

So I suggest keep both unicode and ascii algorithm.

I will study the macros you mentioned tonight and figure out a new solution
at the end of day.
Date: Tue, 17 Apr 2018 00:29:39 +1000
To be honest I'm 100% confident about my current solution. I'm
actually an expert in dealing unicode related C programming.
I appreciate your expertise, that's not the issue here. The main
issue is how to support all the range of characters that Emacs wants
to support.
It's standard practice to convert UTF-8 character to UTF-16
character before any further string operation.
To anwser your questions, I have to introduce East Asian culture at first.
China/Japan/Korea use same Hanzi (Chinese characters). China's
national standard GBK defines
21886 characters used Eas Asians.
UTF-16 is compatilbe with GBK.
Even most educated Chinese knows only 6000~8000 characters.
UTF-32 exists because there are another 50000 ancient Hanzi which no
Chinese uses.
Among 99089 characters of Unicode v5.0, 71226 characters are Hanzi.
In short, you got about 22000 non-hanzi characters and 21000 Hanzi
characters. UTF-16 is totally fine
to deal with it.
UTF-16 is fine for characters within the BMP, yes. But Emacs supports
much more, see below.
Say my name is 陈斌, my colleague's name is 张斌. Only one Hanzi difference, right?
While reading file named "陈斌's doucment0", I want to list other
documents with similar file names.
Using my byte compare algorithm, "陈斌's document1", "陈斌's document2" is
on the top of the sorted
document list. Make sense.
Without byte comparing, my family name "陈" is not more important than
an latin character
So files like "张斌's document1", "张斌's document2" is on the top of the list now.
You are saying that a difference of 1 character has more "weight" for
some characters than for others. That might be so, but using the
number of bytes in the UTF-8 representation as that weight makes very
little sense to me, because the length of the UTF-8 sequence for a
it depends on when was a particular character introduced into Unicode.
It has nothing to do with "importance" of a character.
Thus, your example might produce a sensible result when comparing
ASCII characters vs non-ASCII characters, but will make much less
sense when comparing one non-ASCII character with another non-ASCII
character. For example, emoji characters, which need 4 bytes in UTF-8
sequences, will be considered twice as "important" as the above Hanzi
characters of your example (and 4 times as important as ASCII). The
results will be totally unexpected and unpredictable (unless the user
knows the entire Unicode database by heart).
So I think features such as what you want to achieve with the above
example will have to use some other additional feature, like "weighted
difference", and using the byte length as a kind of surrogate "weight"
for this purpose is IMO not a good idea, even though it could look
like that at first glance.
2. As mentioned, China has national standard GBK which is compatible
with UTF-16.
Every character is a 16 bit integer. It's common practice to convert
UTF-8 string (or other encoded string)
to UTF-16 format before any further string operation. Or else, it's
impossbile to display Japanese, Korean, Chinese
in one document (considering tyical user operations, text
searching/replacing ...)
I've been doing this since day one of my career because I'm Chinese.
As I said, this is okay for characters in the BMP. But Emacs must
support much more.
First, we support the latest Unicode 10.0, which added a lot of
characters outside the BMP. You have the emoji there, you have many
new symbols there (the Alchemichal Symbols, Geometric Shapes, and
Supplemental Arrows-C blocks), Domino Tiles, Mathematical
Alphanumerics, Musical symbols, and many scripts that there's no
reason to tell people we don't support in Emacs.
Moreover, Emacs's internal representation of raw 8-bit bytes uses the
area #x3FFF00..#x3FFFFF, which need 5 bytes in its UTF-8 sequence. We
cannot exclude the possibility that strings we are comparing have
embedded raw bytes in them, this happens in practice and must work as
expected.
The integer is called wide character (wchar_t in standard C, or WCHAR
in Windows GUI SDK),
Windows indeed uses a 16-bit wchar_t type, which is why its support
for characters beyond the BMP is problematic and in some cases simply
non-existent, because standard C library functions that accept a
single wide character argument are unable to cope with UTF-16 encoding
that requires 2 16-bit words to represent a character. So, for
example, you don't have a way of looking for a character outside the
BMP in a wide character string by calling the standard function
wcschr, because its 2nd argument is a single wchar_t wide character,
which can only express characters inside the BMP.
Emacs cannot tolerate such limitations, not in the year 2018.
So using implementations that only support the BMP is really a
non-starter for Emacs. We cannot accept such code.
Now allow me explain why I can't use CHAR_STRING_ADVANCE. The first
parameter 'c' is ALREADY a wide character.
I apologize for my stupid typo. I meant STRING_CHAR_ADVANCE, not
CHAR_STRING_ADVANCE. At least I didn't make such a mistake in the 2nd
macro I mentioned, FETCH_STRING_CHAR_ADVANCE...
It's efficient to extract characters into one unsigned short array in
one shot.
It is indeed quite efficient, but it requires one more pass of the
entire string, which is unnecessary. And anyway, the main reason not
to convert to UTF-16 is what I wrote above about supporting characters
beyond the BMP, not the extra loop.
Or else, I need convert character index to byte offset first. So I
need get byte offset by calling 'string_char_to_byte'
twice to get offset of two elements for comparing, The element coould
could be 1, 2, 3 bytes. So I still need convert bytes
into a 4 byte integer for comparing.
The macros I mentioned already do that for you, they manage both
character offsets and byte offsets efficiently (and they are much more
efficient than string_char_to_byte because they advance sequentially,
whereas string_char_to_byte is for random access to a string).
Thanks.
P.S. Please, let's return this discussion to the list, where it
belongs. I'm disappointed that this was again a private email, and
the list isn't CC'ed. This discussion should be public, not private.
--
help me, help you.
Eli Zaretskii
2018-04-17 15:44:21 UTC
Permalink
Date: Tue, 17 Apr 2018 12:43:33 +1000
You can't increase character weight of certain race or religion in Emacs.
It's politically wrong.
If the weight comes from the user/Lisp program which uses this
primitive, then whatever politics is involved is up to the caller.
chen bin
2018-04-18 07:11:49 UTC
Permalink
Sure.

What's your opinion on my latest patch?
Post by Eli Zaretskii
Date: Tue, 17 Apr 2018 12:43:33 +1000
You can't increase character weight of certain race or religion in Emacs.
It's politically wrong.
If the weight comes from the user/Lisp program which uses this
primitive, then whatever politics is involved is up to the caller.
--
help me, help you.
chen bin
2018-04-17 12:31:20 UTC
Permalink
Hi, Eli,
As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.

I did some peformance test.

My original wide character array solution is two time fast as current solution.

But I understand now we avoid repeated memory allocation for wide
character array. So the generic performance looks OK to me.

I also implemented the byte comparing version. It's 4 times as fast. And I do
need use it to compare file path in my package 'counsel-etags'.

That's the major reason why I started this task

The fille path couldn't contain any funny characters (emoji). so
it'sperfectly fine
to use byte comparing version.

Regards,
Chen
Date: Tue, 17 Apr 2018 11:32:23 +1000
I intentionally avoid cc `emacs-devel` in my previous mail because I
feel `culture/race` is sensitive thing.
Please don't. There's no sensitivity about this stuff on emacs-devel,
in my long experience with Emacs development.
I will study your mail and figure out a better solution.
Thanks. Feel free to ask more questions, and sorry again for my
stupid typo mistake.
--
help me, help you.
Eli Zaretskii
2018-04-19 08:05:52 UTC
Permalink
Date: Tue, 17 Apr 2018 22:31:20 +1000
As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.
Thanks.
I also implemented the byte comparing version. It's 4 times as fast. And I do
need use it to compare file path in my package 'counsel-etags'.
File names are just strings for this purpose, and they can potentially
include any non-zero characters. So I don't see why they are special.
The fille path couldn't contain any funny characters (emoji). so
it'sperfectly fine
to use byte comparing version.
File names can very well include emoji and other "funny" characters,
Emacs supports that on all modern systems (including even MS-Windows).
diff --git a/etc/NEWS b/etc/NEWS
index 5aa92e2991..3cce2c48c7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
+++
** New function assoc-delete-all.
+** New function string-distance to calculate Levenshtein distance between two strings.
This long line should be filled using the fill-column setting we use
in NEWS. Even better, make the header a short summary, like

** New function 'string-distance'

and then describe its functionality in a separate sentence that starts
immediately below that header.
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+ doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, we compare character of strings.
+If BYTECOMPARE is t, we compare byte of strings.
Please lose the "we" part, it's inappropriate in documentation,
because it describes what Emacs does.
+Comparing by byte is faster and non-ascii characters has weighted distance.
I would delete this sentence, it is IMO confusing more than anything
else. (And I still think the bytewise comparison is not needed.)
+ bool use_bytecompare = !NILP(bytecompare);
^^
Space between these 2 characters.
+ else
+ {
+ int c1, c2;
+ ptrdiff_t i1, i1_byte, i2, i2_byte;
+ i2 = i2_byte = 0;
+ for (x = 1; x <= len2; x++)
Please move the initialization of i2 and i2_byte into the for-loop
initializer (suing the comma operator).
+ i1 = i1_byte = 0;
+ for (y = 1, lastdiag = x - 1; y <= len1; y++)
Likewise here with i1 and i1_byte.

Thanks.
chen bin
2018-04-19 14:55:15 UTC
Permalink
Hi, Eli,
I attached the new patch.

A software project usually contains nomal directory and file names.
Linux kernel source, for example.

My package `counsel-etags` will use byte compare version.

Regards,
Chen
Post by Eli Zaretskii
Date: Tue, 17 Apr 2018 22:31:20 +1000
As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.
Thanks.
I also implemented the byte comparing version. It's 4 times as fast. And I do
need use it to compare file path in my package 'counsel-etags'.
File names are just strings for this purpose, and they can potentially
include any non-zero characters. So I don't see why they are special.
The fille path couldn't contain any funny characters (emoji). so
it'sperfectly fine
to use byte comparing version.
File names can very well include emoji and other "funny" characters,
Emacs supports that on all modern systems (including even MS-Windows).
diff --git a/etc/NEWS b/etc/NEWS
index 5aa92e2991..3cce2c48c7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
+++
** New function assoc-delete-all.
+** New function string-distance to calculate Levenshtein distance between two strings.
This long line should be filled using the fill-column setting we use
in NEWS. Even better, make the header a short summary, like
** New function 'string-distance'
and then describe its functionality in a separate sentence that starts
immediately below that header.
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+ doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, we compare character of strings.
+If BYTECOMPARE is t, we compare byte of strings.
Please lose the "we" part, it's inappropriate in documentation,
because it describes what Emacs does.
+Comparing by byte is faster and non-ascii characters has weighted distance.
I would delete this sentence, it is IMO confusing more than anything
else. (And I still think the bytewise comparison is not needed.)
+ bool use_bytecompare = !NILP(bytecompare);
^^
Space between these 2 characters.
+ else
+ {
+ int c1, c2;
+ ptrdiff_t i1, i1_byte, i2, i2_byte;
+ i2 = i2_byte = 0;
+ for (x = 1; x <= len2; x++)
Please move the initialization of i2 and i2_byte into the for-loop
initializer (suing the comma operator).
+ i1 = i1_byte = 0;
+ for (y = 1, lastdiag = x - 1; y <= len1; y++)
Likewise here with i1 and i1_byte.
Thanks.
--
help me, help you.
chen bin
2018-04-20 04:37:26 UTC
Permalink
Please hold on, I will send you a new patch tonight.
Hi, Eli,
I attached the new patch.
A software project usually contains nomal directory and file names.
Linux kernel source, for example.
My package `counsel-etags` will use byte compare version.
Regards,
Chen
Post by Eli Zaretskii
Date: Tue, 17 Apr 2018 22:31:20 +1000
As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.
Thanks.
I also implemented the byte comparing version. It's 4 times as fast. And I do
need use it to compare file path in my package 'counsel-etags'.
File names are just strings for this purpose, and they can potentially
include any non-zero characters. So I don't see why they are special.
The fille path couldn't contain any funny characters (emoji). so
it'sperfectly fine
to use byte comparing version.
File names can very well include emoji and other "funny" characters,
Emacs supports that on all modern systems (including even MS-Windows).
diff --git a/etc/NEWS b/etc/NEWS
index 5aa92e2991..3cce2c48c7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
+++
** New function assoc-delete-all.
+** New function string-distance to calculate Levenshtein distance between two strings.
This long line should be filled using the fill-column setting we use
in NEWS. Even better, make the header a short summary, like
** New function 'string-distance'
and then describe its functionality in a separate sentence that starts
immediately below that header.
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+ doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, we compare character of strings.
+If BYTECOMPARE is t, we compare byte of strings.
Please lose the "we" part, it's inappropriate in documentation,
because it describes what Emacs does.
+Comparing by byte is faster and non-ascii characters has weighted distance.
I would delete this sentence, it is IMO confusing more than anything
else. (And I still think the bytewise comparison is not needed.)
+ bool use_bytecompare = !NILP(bytecompare);
^^
Space between these 2 characters.
+ else
+ {
+ int c1, c2;
+ ptrdiff_t i1, i1_byte, i2, i2_byte;
+ i2 = i2_byte = 0;
+ for (x = 1; x <= len2; x++)
Please move the initialization of i2 and i2_byte into the for-loop
initializer (suing the comma operator).
+ i1 = i1_byte = 0;
+ for (y = 1, lastdiag = x - 1; y <= len1; y++)
Likewise here with i1 and i1_byte.
Thanks.
--
help me, help you.
--
help me, help you.
chen bin
2018-04-20 10:47:23 UTC
Permalink
Hi, Eli,

Attached is latest patch. It will use byte compare aglorithm if both
strings are not multi-byte strings.

Regards,
Chen
Hi, Eli,
I attached the new patch.
A software project usually contains nomal directory and file names.
Linux kernel source, for example.
My package `counsel-etags` will use byte compare version.
Regards,
Chen
Post by Eli Zaretskii
Date: Tue, 17 Apr 2018 22:31:20 +1000
As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.
Thanks.
I also implemented the byte comparing version. It's 4 times as fast. And I do
need use it to compare file path in my package 'counsel-etags'.
File names are just strings for this purpose, and they can potentially
include any non-zero characters. So I don't see why they are special.
The fille path couldn't contain any funny characters (emoji). so
it'sperfectly fine
to use byte comparing version.
File names can very well include emoji and other "funny" characters,
Emacs supports that on all modern systems (including even MS-Windows).
diff --git a/etc/NEWS b/etc/NEWS
index 5aa92e2991..3cce2c48c7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
+++
** New function assoc-delete-all.
+** New function string-distance to calculate Levenshtein distance between two strings.
This long line should be filled using the fill-column setting we use
in NEWS. Even better, make the header a short summary, like
** New function 'string-distance'
and then describe its functionality in a separate sentence that starts
immediately below that header.
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+ doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, we compare character of strings.
+If BYTECOMPARE is t, we compare byte of strings.
Please lose the "we" part, it's inappropriate in documentation,
because it describes what Emacs does.
+Comparing by byte is faster and non-ascii characters has weighted distance.
I would delete this sentence, it is IMO confusing more than anything
else. (And I still think the bytewise comparison is not needed.)
+ bool use_bytecompare = !NILP(bytecompare);
^^
Space between these 2 characters.
+ else
+ {
+ int c1, c2;
+ ptrdiff_t i1, i1_byte, i2, i2_byte;
+ i2 = i2_byte = 0;
+ for (x = 1; x <= len2; x++)
Please move the initialization of i2 and i2_byte into the for-loop
initializer (suing the comma operator).
+ i1 = i1_byte = 0;
+ for (y = 1, lastdiag = x - 1; y <= len1; y++)
Likewise here with i1 and i1_byte.
Thanks.
--
help me, help you.
--
help me, help you.
Eli Zaretskii
2018-04-21 07:22:54 UTC
Permalink
Date: Fri, 20 Apr 2018 20:47:23 +1000
Attached is latest patch. It will use byte compare aglorithm if both
strings are not multi-byte strings.
Thanks, let's give people a few days to comment.
Juri Linkov
2018-04-21 20:47:30 UTC
Permalink
Post by Eli Zaretskii
Post by chen bin
Attached is latest patch. It will use byte compare aglorithm if both
strings are not multi-byte strings.
Thanks, let's give people a few days to comment.
I have no comments on this implementation, but I wonder if it's possible
to adapt this algorithm to implement fuzzy search with the distance?
This supposes there is a customizable option that defines a preferred
default distance, and then the search functions search text in the buffer
to find the strings similar to the search string. I don't know if it
the same how is implemented for “Similarity search” in LibreOffice Writer.
But it seems this could find the same text as char-fold search matching
equivalent characters and ignoring diacritics or glyphless characters
like “word joiner”.
Eli Zaretskii
2018-04-28 07:36:20 UTC
Permalink
Date: Sat, 21 Apr 2018 10:22:54 +0300
Date: Fri, 20 Apr 2018 20:47:23 +1000
Attached is latest patch. It will use byte compare aglorithm if both
strings are not multi-byte strings.
Thanks, let's give people a few days to comment.
No further comments, so I pushed your changes.

A few comments for your future contributions:

. Please include with the changes a commit log message, formatted as
described in CONTRIBUTE.

. We encourage to include changes for the relevant manuals with the
code changes; please try doing that in the future. If you do
include the relevant manual changes, mark the NEWS entry with
"+++" to indicate that.

. The tests should be in the file named after the file where the
relevant feature(s) is/are defined. If the code is in src/FOO.c,
the tests should be in test/src /FOO-tests.el; if the code is in
lisp/foo/BAR.el, the tests should be in test/lisp/foo/BAR-tests.el.

Thanks.
chen bin
2018-05-06 09:53:30 UTC
Permalink
Noted. Thanks
Post by Eli Zaretskii
Date: Sat, 21 Apr 2018 10:22:54 +0300
Date: Fri, 20 Apr 2018 20:47:23 +1000
Attached is latest patch. It will use byte compare aglorithm if both
strings are not multi-byte strings.
Thanks, let's give people a few days to comment.
No further comments, so I pushed your changes.
. Please include with the changes a commit log message, formatted as
described in CONTRIBUTE.
. We encourage to include changes for the relevant manuals with the
code changes; please try doing that in the future. If you do
include the relevant manual changes, mark the NEWS entry with
"+++" to indicate that.
. The tests should be in the file named after the file where the
relevant feature(s) is/are defined. If the code is in src/FOO.c,
the tests should be in test/src /FOO-tests.el; if the code is in
lisp/foo/BAR.el, the tests should be in test/lisp/foo/BAR-tests.el.
Thanks.
--
help me, help you.
Thien-Thi Nguyen
2018-04-20 06:01:52 UTC
Permalink
() chen bin <***@gmail.com>
() Fri, 20 Apr 2018 00:55:15 +1000

A software project usually contains nomal directory and file
names.

What you say is perfectly true.

However, i think that truth tends to lead programmers astray, so
that they grow "inwards" (to cater for the "norm", which over
time becomes ever more refined (restricted) in its definition),
and not "upwards" (to stretch the mind (and thus code that is
after all merely mindstuff writ stylishly) to consider what is
similar and what is different, in many (ideally, all) cases).

Of course, "perfect is the enemy of good" so they say, and one
line of code in the repo is worth two in *scratch*, so as always
it's a question of balance.

Anyway, glad to see this functionality added to Emacs.

friends see eye to eye.
always? no. but heart to heart:
character distance.
--
Thien-Thi Nguyen -----------------------------------------------
(defun responsep (query)
(pcase (context query)
(`(technical ,ml) (correctp ml))
...)) 748E A0E8 1CB8 A748 9BFA
--------------------------------------- 6CE4 6703 2224 4C80 7502
Paul Eggert
2018-04-15 18:53:05 UTC
Permalink
Post by Chen Bin
As 'org-babel-edit-distance' documented, it will "Return the edit
(levenshtein) distance between strings S1 S2". So the problem here is to
calculate*Levenshtein distance*.
First, I doubt whether the callers care whether the code computes
Levenshtein distance, LCS distance, or some other reasonable
string-distance measure. Second, the Myers-Ukkonen algorithm does
compute Levenshtein distance; see, for example:

Papamichail D, Papamichail G. Improved algorithms for approximate string
matching. BMC Bioinformatics. 2009; 10(Suppl 1): S10.
https://dx.doi.org/10.1186/1471-2105-10-S1-S10
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648743/

I don't offhand know whether diffseq.h uses the original Myers-Ukkonen
algorithm or one of Myers's variations with a different distance
measure, but if it's the latter and if users really care then we should
be able to change the algorithm to match the requirements.
Nathan Moreau
2018-04-14 17:18:08 UTC
Permalink
Hi,

What is the difference with the code present in lib/diffseq.h?
Post by Chen Bin
Hello,
Attached patch implemented 'string-distance' is much faster then
existing Lisp implementation like 'org-babel-edit-distance'.
(setq s1 "projs1/emacs/etc/imeees/icon/emacs-document.svg")
(setq s2 "projs2/test/etc/images/icons/emacs-document23.svg")
(let* ((gc-cons-threshold most-positive-fixnum))
(message "%S vs %S"
(benchmark-run-compiled 100
(org-babel-edit-distance s1 s2))
(benchmark-run-compiled 100
(string-distance s1 s2))))
(0.506494223 0 0.0) vs (0.001317414 0 0.0)
--
Best Regards,
Chen Bin
--
Help me, help you
Paul Eggert
2018-04-14 17:36:49 UTC
Permalink
Post by Nathan Moreau
What is the difference with the code present in lib/diffseq.h?
lib/diffseq.h uses the Myers-Ukkonen algorithm that scales better for the common
case where strings are closely related. If the two strings are length N and
their Levenshtein distance is D (where D is much less than N), then
lib/diffseq.h is O(N*D) whereas the proposed algorithm is O(N**2).

So yes, it'd be better if the code used lib/diffseq.h rather than rolled its own
algorithm.
Andreas Politz
2018-04-15 18:17:13 UTC
Permalink
Post by Paul Eggert
lib/diffseq.h uses the Myers-Ukkonen algorithm that scales better for
the common case where strings are closely related. If the two strings
are length N and their Levenshtein distance is D (where D is much less
than N), then lib/diffseq.h is O(N*D) whereas the proposed algorithm
is O(N**2).
The Ukkonen algorithm also allows for D to be an input parameter in form
of a maximal distance it should search for. This makes this observation
even more important, since most callers, I presume, are only interested
in string-pairs with distances below some threshold.

Incidentally the (only) application of the mentioned Org function
(org-babel-edit-distance) uses D=2 (in Emacs 25.3.1).

Andreas
Loading...